SeqHound: biological sequence and structure database as a platform for bioinformatics research
2002-01-01
Background SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment. Results SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries. Conclusions The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit. PMID:12401134
Computation of repetitions and regularities of biologically weighted sequences.
Christodoulakis, M; Iliopoulos, C; Mouchard, L; Perdikuri, K; Tsakalidis, A; Tsichlas, K
2006-01-01
Biological weighted sequences are used extensively in molecular biology as profiles for protein families, in the representation of binding sites and often for the representation of sequences produced by a shotgun sequencing strategy. In this paper, we address three fundamental problems in the area of biologically weighted sequences: (i) computation of repetitions, (ii) pattern matching, and (iii) computation of regularities. Our algorithms can be used as basic building blocks for more sophisticated algorithms applied on weighted sequences.
A Study of the Comparative Effectiveness of Zoology Prerequisites at Slippery Rock State College.
ERIC Educational Resources Information Center
Morrison, William Sechler
This study compared the effectiveness of three sequences of prerequisite courses required before taking zoology. Sequence 1 prerequisite courses consisted of general biology and human biology; Sequence 2 consisted of general biology; and Sequence 3 required cell biology. Zoology students in the spring of 1972 were pretest and a posttest. The mean…
It’s More Than Stamp Collecting: How Genome Sequencing Can Unify Biological Research
Richards, Stephen
2015-01-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, whilst the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to “Big Science” survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. PMID:26003218
It's more than stamp collecting: how genome sequencing can unify biological research.
Richards, Stephen
2015-07-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, while the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to 'big science' survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. Copyright © 2015 Elsevier Ltd. All rights reserved.
Unified Deep Learning Architecture for Modeling Biology Sequence.
Wu, Hongjie; Cao, Chengyuan; Xia, Xiaoyan; Lu, Qiang
2017-10-09
Prediction of the spatial structure or function of biological macromolecules based on their sequence remains an important challenge in bioinformatics. When modeling biological sequences using traditional sequencing models, characteristics, such as long-range interactions between basic units, the complicated and variable output of labeled structures, and the variable length of biological sequences, usually lead to different solutions on a case-by-case basis. This study proposed the use of bidirectional recurrent neural networks based on long short-term memory or a gated recurrent unit to capture long-range interactions by designing the optional reshape operator to adapt to the diversity of the output labels and implementing a training algorithm to support the training of sequence models capable of processing variable-length sequences. Additionally, the merge and pooling operators enhanced the ability to capture short-range interactions between basic units of biological sequences. The proposed deep-learning model and its training algorithm might be capable of solving currently known biological sequence-modeling problems through the use of a unified framework. We validated our model on one of the most difficult biological sequence-modeling problems currently known, with our results indicating the ability of the model to obtain predictions of protein residue interactions that exceeded the accuracy of current popular approaches by 10% based on multiple benchmarks.
Advanced Applications of Next-Generation Sequencing Technologies to Orchid Biology.
Yeh, Chuan-Ming; Liu, Zhong-Jian; Tsai, Wen-Chieh
2018-01-01
Next-generation sequencing technologies are revolutionizing biology by permitting, transcriptome sequencing, whole-genome sequencing and resequencing, and genome-wide single nucleotide polymorphism profiling. Orchid research has benefited from this breakthrough, and a few orchid genomes are now available; new biological questions can be approached and new breeding strategies can be designed. The first part of this review describes the unique features of orchid biology. The second part provides an overview of the current next-generation sequencing platforms, many of which are already used in plant laboratories. The third part summarizes the state of orchid transcriptome and genome sequencing and illustrates current achievements. The genetic sequences currently obtained will not only provide a broad scope for the study of orchid biology, but also serves as a starting point for uncovering the mystery of orchid evolution.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2008-04-08
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2010-10-05
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
SeqCompress: an algorithm for biological sequence compression.
Sardaraz, Muhammad; Tahir, Muhammad; Ikram, Ataul Aziz; Bajwa, Hassan
2014-10-01
The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms. Copyright © 2014 Elsevier Inc. All rights reserved.
Single-cell sequencing technologies: current and future.
Liang, Jialong; Cai, Wanshi; Sun, Zhongsheng
2014-10-20
Intensively developed in the last few years, single-cell sequencing technologies now present numerous advantages over traditional sequencing methods for solving the problems of biological heterogeneity and low quantities of available biological materials. The application of single-cell sequencing technologies has profoundly changed our understanding of a series of biological phenomena, including gene transcription, embryo development, and carcinogenesis. However, before single-cell sequencing technologies can be used extensively, researchers face the serious challenge of overcoming inherent issues of high amplification bias, low accuracy and reproducibility. Here, we simply summarize the techniques used for single-cell isolation, and review the current technologies used in single-cell genomic, transcriptomic, and epigenomic sequencing. We discuss the merits, defects, and scope of application of single-cell sequencing technologies and then speculate on the direction of future developments. Copyright © 2014 Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, and Genetics Society of China. Published by Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Mary, Michael Todd
High school students in the United States for the past century have typically taken science courses in a sequence of biology followed by chemistry and concluding with physics. An alternative sequence, typically referred to as "physics first" inverts the traditional sequence by having students begin with physics and end with biology. Proponents of physics first cite advances in biological sciences that have dramatically changed the nature of high school biology and the potential benefit to student learning in math that would accompany taking an algebra-based physics course in the early years of high school to support changing the sequence. Using a quasi-experimental, quantitative research design, the purpose of this study was to investigate the impact of science course sequencing on student achievement in math and science at a school district that offered both course sequences. The Texas state end-of-course exams in biology, chemistry, physics, algebra I and geometry were used as the instruments measuring student achievement in math and science at the end of each academic year. Various statistical models were used to analyze these achievement data. The conclusion was, for students in this study, the sequence in which students took biology, chemistry, and physics had little or no impact on performance on the end-of-course assessments in each of these courses. Additionally there was only a minimal effect found with respect to math performance, leading to the conclusion that neither the traditional or "physics first" science course sequence presented an advantage for student achievement in math or science.
Next-Generation Sequencing Platforms
NASA Astrophysics Data System (ADS)
Mardis, Elaine R.
2013-06-01
Automated DNA sequencing instruments embody an elegant interplay among chemistry, engineering, software, and molecular biology and have built upon Sanger's founding discovery of dideoxynucleotide sequencing to perform once-unfathomable tasks. Combined with innovative physical mapping approaches that helped to establish long-range relationships between cloned stretches of genomic DNA, fluorescent DNA sequencers produced reference genome sequences for model organisms and for the reference human genome. New types of sequencing instruments that permit amazing acceleration of data-collection rates for DNA sequencing have been developed. The ability to generate genome-scale data sets is now transforming the nature of biological inquiry. Here, I provide an historical perspective of the field, focusing on the fundamental developments that predated the advent of next-generation sequencing instruments and providing information about how these instruments work, their application to biological research, and the newest types of sequencers that can extract data from single DNA molecules.
Using cellular automata to generate image representation for biological sequences.
Xiao, X; Shao, S; Ding, Y; Huang, Z; Chen, X; Chou, K-C
2005-02-01
A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.
NASA Astrophysics Data System (ADS)
Hurst March, Robin Denise
This investigation compared student achievement and attitudes toward science from three different sequencing approaches used in teaching biology to nonscience students. The three sequencing approaches were the lecture course only, lecture/laboratory courses taken together, and laboratory with previously taken lecture approach. The purposes of this study were to determine if (1) a relationship exists between the Attitude Towards Science in School Assessment (ATSSA) scores (Germann, 1988) and biology achievement, (2) a difference exists among the ATSSA scores and sequencing, (3) a difference exists among the biology achievement scores and sequencing, and (4) the ATSSA is a reliable instrument of science attitude assessment for the undergraduate students in an introductory biology nonmajors laboratory and lecture courses at a research I institution during the fall semester 1996. Fifty-four students comprised the lecture only group, 90 students comprised the lecture and laboratory taken together approach, and 23 students comprised the laboratory only approach. Research questions addressed were (1) What are the differences in student biology achievement as a function of the three different methods of instruction? (2) What are the differences in student attitude towards science as a function of the three different methods of instruction? (3) What is the relationship between post-attitude (ATSSA) and biology achievement for each of the three methods of instruction? An analysis of variance utilized the mean posttest scores on the ATSSA and mean achievement scores as the dependent variables. The independent variables were the three different sequences of enrollment in introductory biology. At the.05 level of significance, it was found that no significant difference existed between the ATTS and laboratory/lecture sequence. At the.05 level of significance, it was found that no significant difference existed between achievement and laboratory/lecture sequence. A Pearson product moment correlation was used to see if a relationship existed between posttest ATSSA scores and achievement totals in each sequence. A significant relationship was noted between the ATSSA and achievement in each sequence that involved a laboratory component.
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Biological sequence compression algorithms.
Matsumoto, T; Sadakane, K; Imai, H
2000-01-01
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.
ERIC Educational Resources Information Center
Gaubatz, Julie
2013-01-01
Studies of high-school science course sequences have been limited primarily to a small number of site-specific investigations comparing traditional science sequences (e.g., Biology-Chemistry-Physics: BCP) to various Physics First-influenced sequences (Physics-Chemistry-Biology: PCB). The present study summarizes a five-year program evaluation…
UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences.
Du, Pu-Feng; Zhao, Wei; Miao, Yang-Yang; Wei, Le-Yi; Wang, Likun
2017-11-14
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.
Meinicke, Peter; Tech, Maike; Morgenstern, Burkhard; Merkl, Rainer
2004-01-01
Background Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals. Results We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon. Conclusions We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems. PMID:15511290
IBS: an illustrator for the presentation and visualization of biological sequences.
Liu, Wenzhong; Xie, Yubin; Ma, Jiyong; Luo, Xiaotong; Nie, Peng; Zuo, Zhixiang; Lahrmann, Urs; Zhao, Qi; Zheng, Yueyuan; Zhao, Yong; Xue, Yu; Ren, Jian
2015-10-15
Biological sequence diagrams are fundamental for visualizing various functional elements in protein or nucleotide sequences that enable a summarization and presentation of existing information as well as means of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can be directly exported into a publication-quality figure. The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and JavaScript. Both the standalone package and online service are freely available at http://ibs.biocuckoo.org. renjian.sysu@gmail.com or xueyu@hust.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
IBS: an illustrator for the presentation and visualization of biological sequences
Liu, Wenzhong; Xie, Yubin; Ma, Jiyong; Luo, Xiaotong; Nie, Peng; Zuo, Zhixiang; Lahrmann, Urs; Zhao, Qi; Zheng, Yueyuan; Zhao, Yong; Xue, Yu; Ren, Jian
2015-01-01
Summary: Biological sequence diagrams are fundamental for visualizing various functional elements in protein or nucleotide sequences that enable a summarization and presentation of existing information as well as means of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can be directly exported into a publication-quality figure. Availability and implementation: The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and JavaScript. Both the standalone package and online service are freely available at http://ibs.biocuckoo.org. Contact: renjian.sysu@gmail.com or xueyu@hust.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26069263
A Personal Journey of Discovery: Developing Technology and Changing Biology
NASA Astrophysics Data System (ADS)
Hood, Lee
2008-07-01
This autobiographical article describes my experiences in developing chemically based, biological technologies for deciphering biological information: DNA, RNA, proteins, interactions, and networks. The instruments developed include protein and DNA sequencers and synthesizers, as well as ink-jet technology for synthesizing DNA chips. Diverse new strategies for doing biology also arose from novel applications of these instruments. The functioning of these instruments can be integrated to generate powerful new approaches to cloning and characterizing genes from a small amount of protein sequence or to using gene sequences to synthesize peptide fragments so as to characterize various properties of the proteins. I also discuss the five paradigm changes in which I have participated: the development and integration of biological instrumentation; the human genome project; cross-disciplinary biology; systems biology; and predictive, personalized, preventive, and participatory (P4) medicine. Finally, I discuss the origins, the philosophy, some accomplishments, and the future trajectories of the Institute for Systems Biology.
Use of Internet Resources in the Biology Lecture Classroom.
ERIC Educational Resources Information Center
Francis, Joseph W.
2000-01-01
Introduces internet resources that are available for instructional use in biology classrooms. Provides information on video-based technologies to create and capture video sequences, interactive web sites that allow interaction with biology simulations, online texts, and interactive videos that display animated video sequences. (YDS)
Guzzi, Pietro Hiram; Milenkovic, Tijana
2018-05-01
Analogous to genomic sequence alignment that allows for across-species transfer of biological knowledge between conserved sequence regions, biological network alignment can be used to guide the knowledge transfer between conserved regions of molecular networks of different species. Hence, biological network alignment can be used to redefine the traditional notion of a sequence-based homology to a new notion of network-based homology. Analogous to genomic sequence alignment, there exist local and global biological network alignments. Here, we survey prominent and recent computational approaches of each network alignment type and discuss their (dis)advantages. Then, as it was recently shown that the two approach types are complementary, in the sense that they capture different slices of cellular functioning, we discuss the need to reconcile the two network alignment types and present a recent first step in this direction. We conclude with some open research problems on this topic and comment on the usefulness of network alignment in other domains besides computational biology.
Efficient Mining of Interesting Patterns in Large Biological Sequences
Rashid, Md. Mamunur; Karim, Md. Rezaul; Jeong, Byeong-Soo
2012-01-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time. PMID:23105928
Efficient mining of interesting patterns in large biological sequences.
Rashid, Md Mamunur; Karim, Md Rezaul; Jeong, Byeong-Soo; Choi, Ho-Jin
2012-03-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.
A biological compression model and its applications.
Cao, Minh Duc; Dix, Trevor I; Allison, Lloyd
2011-01-01
A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.
Olson, Nathan D.; Lund, Steven P.; Zook, Justin M.; Rojas-Cornejo, Fabiola; Beck, Brian; Foy, Carole; Huggett, Jim; Whale, Alexandra S.; Sui, Zhiwei; Baoutina, Anna; Dobeson, Michael; Partis, Lina; Morrow, Jayne B.
2015-01-01
This study presents the results from an interlaboratory sequencing study for which we developed a novel high-resolution method for comparing data from different sequencing platforms for a multi-copy, paralogous gene. The combination of PCR amplification and 16S ribosomal RNA gene (16S rRNA) sequencing has revolutionized bacteriology by enabling rapid identification, frequently without the need for culture. To assess variability between laboratories in sequencing 16S rRNA, six laboratories sequenced the gene encoding the 16S rRNA from Escherichia coli O157:H7 strain EDL933 and Listeria monocytogenes serovar 4b strain NCTC11994. Participants performed sequencing methods and protocols available in their laboratories: Sanger sequencing, Roche 454 pyrosequencing®, or Ion Torrent PGM®. The sequencing data were evaluated on three levels: (1) identity of biologically conserved position, (2) ratio of 16S rRNA gene copies featuring identified variants, and (3) the collection of variant combinations in a set of 16S rRNA gene copies. The same set of biologically conserved positions was identified for each sequencing method. Analytical methods using Bayesian and maximum likelihood statistics were developed to estimate variant copy ratios, which describe the ratio of nucleotides at each identified biologically variable position, as well as the likely set of variant combinations present in 16S rRNA gene copies. Our results indicate that estimated variant copy ratios at biologically variable positions were only reproducible for high throughput sequencing methods. Furthermore, the likely variant combination set was only reproducible with increased sequencing depth and longer read lengths. We also demonstrate novel methods for evaluating variable positions when comparing multi-copy gene sequence data from multiple laboratories generated using multiple sequencing technologies. PMID:27077030
Liu, Fei; Wu, Xiao-Li; Liu, Ying; Chen, Da-Xia; Zhang, De-Li; Yang, Da-Jian
2016-02-01
Isaria farinosa is the pathogen of the host of Ophiocordyceps sinensis. The present research has analyzed the progress on the molecular biology according to the bibliometrics, the sequences (including the gene sequences) of I. farinosa in the NCBI. The results indicated that different country had published different number of the papers, and had landed different kinds and different number of the sequences (including the gene sequences). China had published the most number of the papers, and had landed the most number of the sequences (including the gene sequences). America had landed the most numbers of the function genes. The main content about the pathogen study was focus on the biological controlling. The main content about the molecular study concentrated on the phylogenies classification. In recent years some protease genes and chitinase genes had been researched. With the increase of the effect on the healthy of O. sinensis, and the whole sequence and more and more pharmacological activities of I. farinosa being made known to the public, the study on the molecular biology of the I. farinosa would be deeper and wider. Copyright© by the Chinese Pharmaceutical Association.
Genome sequences of three strains of Aspergillus flavus for the biological control of Aflatoxin
USDA-ARS?s Scientific Manuscript database
The genomes of three strains of Aspergillus flavus with demonstrated utility for the biological control of aflatoxin were sequenced. These sequences were assembled with MIRA and annotated with Augustus using A. flavus strain 3357 (NCBI EQ963472) as a reference. Each strain had a genome of 36.3 to ...
SNAD: Sequence Name Annotation-based Designer.
Sidorov, Igor A; Reshetov, Denis A; Gorbalenya, Alexander E
2009-08-14
A growing diversity of biological data is tagged with unique identifiers (UIDs) associated with polynucleotides and proteins to ensure efficient computer-mediated data storage, maintenance, and processing. These identifiers, which are not informative for most people, are often substituted by biologically meaningful names in various presentations to facilitate utilization and dissemination of sequence-based knowledge. This substitution is commonly done manually that may be a tedious exercise prone to mistakes and omissions. Here we introduce SNAD (Sequence Name Annotation-based Designer) that mediates automatic conversion of sequence UIDs (associated with multiple alignment or phylogenetic tree, or supplied as plain text list) into biologically meaningful names and acronyms. This conversion is directed by precompiled or user-defined templates that exploit wealth of annotation available in cognate entries of external databases. Using examples, we demonstrate how this tool can be used to generate names for practical purposes, particularly in virology. A tool for controllable annotation-based conversion of sequence UIDs into biologically meaningful names and acronyms has been developed and placed into service, fostering links between quality of sequence annotation, and efficiency of communication and knowledge dissemination among researchers.
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Yeast Genomics for Bread, Beer, Biology, Bucks and Breath
NASA Astrophysics Data System (ADS)
Sakharkar, Kishore R.; Sakharkar, Meena K.
The rapid advances and scale up of projects in DNA sequencing dur ing the past two decades have produced complete genome sequences of several eukaryotic species. The versatile genetic malleability of the yeast, and the high degree of conservation between its cellular processes and those of human cells have made it a model of choice for pioneering research in molecular and cell biology. The complete sequence of yeast genome has proven to be extremely useful as a reference towards the sequences of human and for providing systems to explore key gene functions. Yeast has been a ‘legendary model’ for new technologies and gaining new biological insights into basic biological sciences and biotechnology. This chapter describes the awesome power of yeast genetics, genomics and proteomics in understanding of biological function. The applications of yeast as a screening tool to the field of drug discovery and development are highlighted and the traditional importance of yeast for bakers and brewers is discussed.
Wright, Nicholas J.D.; Alston, Gregory L.
2015-01-01
Objective. To design and assess a horizontally integrated biological sciences course sequence and to determine its effectiveness in imparting the foundational science knowledge necessary to successfully progress through the pharmacy school curriculum and produce competent pharmacy school graduates. Design. A 2-semester course sequence integrated principles from several basic science disciplines: biochemistry, molecular biology, cellular biology, anatomy, physiology, and pathophysiology. Each is a 5-credit course taught 5 days per week, with 50-minute class periods. Assessment. Achievement of outcomes was determined with course examinations, student lecture, and an annual skills mastery assessment. The North American Pharmacist Licensure Examination (NAPLEX) results were used as an indicator of competency to practice pharmacy. Conclusion. Students achieved course objectives and program level outcomes. The biological sciences integrated course sequence was successful in providing students with foundational basic science knowledge required to progress through the pharmacy program and to pass the NAPLEX. The percentage of the school’s students who passed the NAPLEX was not statistically different from the national percentage. PMID:26430276
Gholizadeh, S; Firooziyan, S; Ladonni, H; Hajipirloo, H Mohammadzadeh; Djadid, N Dinparast; Hosseini, A; Raz, A
2015-06-01
Anopheles (Cellia) stephensi Liston 1901 is known as an Asian malaria vector. Three biological forms, namely "mysorensis", "intermediate", and "type" have been earlier reported in this species. Nevertheless, the present morphological and molecular information is insufficient to diagnose these forms. During this investigation, An. stephensi biological forms were morphologically identified and sequenced for odorant-binding protein 1 (Obp1) gene. Also, intron I sequences were used to construct phylogenetic trees. Despite nucleotide sequence variation in exon of AsteObp1, nearly 100% identity was observed at the amino acid level among the three biological forms. In order to overcome difficulties in using egg morphology characters, intron I sequences of An. stephensi Obp1 opens new molecular way to the identification of the main Asian malaria vector biological forms. However, multidisciplinary studies are needed to establish the taxonomic status of An. stephensi. Copyright © 2015 Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Browning, Mark
The purpose of the research was to manipulate two aspects of genetics instruction in order to measure their effects on college, introductory biology students' achievement in genetics. One instructional sequence that was used dealt first with monohybrid autosomal inheritance patterns, then sex-linkage. The alternate sequence was the reverse.…
The Importance of Biological Databases in Biological Discovery.
Baxevanis, Andreas D; Bateman, Alex
2015-06-19
Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access a wide variety of biologically relevant data, including the genomic sequences of an increasingly broad range of organisms. This unit provides a brief overview of major sequence databases and portals, such as GenBank, the UCSC Genome Browser, and Ensembl. Model organism databases, including WormBase, The Arabidopsis Information Resource (TAIR), and those made available through the Mouse Genome Informatics (MGI) resource, are also covered. Non-sequence-centric databases, such as Online Mendelian Inheritance in Man (OMIM), the Protein Data Bank (PDB), MetaCyc, and the Kyoto Encyclopedia of Genes and Genomes (KEGG), are also discussed. Copyright © 2015 John Wiley & Sons, Inc.
The technology and biology of single-cell RNA sequencing.
Kolodziejczyk, Aleksandra A; Kim, Jong Kyoung; Svensson, Valentine; Marioni, John C; Teichmann, Sarah A
2015-05-21
The differences between individual cells can have profound functional consequences, in both unicellular and multicellular organisms. Recently developed single-cell mRNA-sequencing methods enable unbiased, high-throughput, and high-resolution transcriptomic analysis of individual cells. This provides an additional dimension to transcriptomic information relative to traditional methods that profile bulk populations of cells. Already, single-cell RNA-sequencing methods have revealed new biology in terms of the composition of tissues, the dynamics of transcription, and the regulatory relationships between genes. Rapid technological developments at the level of cell capture, phenotyping, molecular biology, and bioinformatics promise an exciting future with numerous biological and medical applications. Copyright © 2015 Elsevier Inc. All rights reserved.
Protein location prediction using atomic composition and global features of the amino acid sequence
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cherian, Betsy Sheena, E-mail: betsy.skb@gmail.com; Nair, Achuthsankar S.
2010-01-22
Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectivelymore » used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy.« less
ERIC Educational Resources Information Center
Soto, Julio G.; Everhart, Jerry
2016-01-01
Biology faculty at San José State University developed, piloted, implemented, and assessed a freshmen course sequence based on the macro-to micro-teaching approach that was team-taught, and organized around unifying themes. Content learning assessment drove the conceptual framework of our course sequence. Content student learning increased…
Bettenbühl, Mario; Rusconi, Marco; Engbert, Ralf; Holschneider, Matthias
2012-01-01
Complex biological dynamics often generate sequences of discrete events which can be described as a Markov process. The order of the underlying Markovian stochastic process is fundamental for characterizing statistical dependencies within sequences. As an example for this class of biological systems, we investigate the Markov order of sequences of microsaccadic eye movements from human observers. We calculate the integrated likelihood of a given sequence for various orders of the Markov process and use this in a Bayesian framework for statistical inference on the Markov order. Our analysis shows that data from most participants are best explained by a first-order Markov process. This is compatible with recent findings of a statistical coupling of subsequent microsaccade orientations. Our method might prove to be useful for a broad class of biological systems.
Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.
Yin, Zekun; Lan, Haidong; Tan, Guangming; Lu, Mian; Vasilakos, Athanasios V; Liu, Weiguo
2017-01-01
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
Ferles, Christos; Beaufort, William-Scott; Ferle, Vanessa
2017-01-01
The present study devises mapping methodologies and projection techniques that visualize and demonstrate biological sequence data clustering results. The Sequence Data Density Display (SDDD) and Sequence Likelihood Projection (SLP) visualizations represent the input symbolical sequences in a lower-dimensional space in such a way that the clusters and relations of data elements are depicted graphically. Both operate in combination/synergy with the Self-Organizing Hidden Markov Model Map (SOHMMM). The resulting unified framework is in position to analyze automatically and directly raw sequence data. This analysis is carried out with little, or even complete absence of, prior information/domain knowledge.
ERIC Educational Resources Information Center
Cline, Erica; Gogarten, Jennifer
2012-01-01
We describe a laboratory exercise developed for the cell and molecular biology quarter of a year-long majors' undergraduate introductory biology sequence. In an analysis of salmon samples collected by students in their local stores and restaurants, DNA sequencing and phylogenetic analysis were used to detect market substitution of Atlantic salmon…
Ramu, Chenna
2003-07-01
SIRW (http://sirw.embl.de/) is a World Wide Web interface to the Simple Indexing and Retrieval System (SIR) that is capable of parsing and indexing various flat file databases. In addition it provides a framework for doing sequence analysis (e.g. motif pattern searches) for selected biological sequences through keyword search. SIRW is an ideal tool for the bioinformatics community for searching as well as analyzing biological sequences of interest.
Galbadrakh, Bulgan; Lee, Kyung-Eun; Park, Hyun-Seok
2012-12-01
Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.
Biological intuition in alignment-free methods: response to Posada.
Ragan, Mark A; Chan, Cheong Xin
2013-08-01
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.
BAC sequencing using pooled methods.
Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina
2015-01-01
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H; Lee, Doheon
2007-01-01
With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene-patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene-patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at http://www.patome.org/; the information is updated bimonthly.
Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H.; Lee, Doheon
2007-01-01
With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene–patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene–patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at ; the information is updated bimonthly. PMID:17085479
Prediction of phenotypes of missense mutations in human proteins from biological assemblies.
Wei, Qiong; Xu, Qifang; Dunbrack, Roland L
2013-02-01
Single nucleotide polymorphisms (SNPs) are the most frequent variation in the human genome. Nonsynonymous SNPs that lead to missense mutations can be neutral or deleterious, and several computational methods have been presented that predict the phenotype of human missense mutations. These methods use sequence-based and structure-based features in various combinations, relying on different statistical distributions of these features for deleterious and neutral mutations. One structure-based feature that has not been studied significantly is the accessible surface area within biologically relevant oligomeric assemblies. These assemblies are different from the crystallographic asymmetric unit for more than half of X-ray crystal structures. We find that mutations in the core of proteins or in the interfaces in biological assemblies are significantly more likely to be disease-associated than those on the surface of the biological assemblies. For structures with more than one protein in the biological assembly (whether the same sequence or different), we find the accessible surface area from biological assemblies provides a statistically significant improvement in prediction over the accessible surface area of monomers from protein crystal structures (P = 6e-5). When adding this information to sequence-based features such as the difference between wildtype and mutant position-specific profile scores, the improvement from biological assemblies is statistically significant but much smaller (P = 0.018). Combining this information with sequence-based features in a support vector machine leads to 82% accuracy on a balanced dataset of 50% disease-associated mutations from SwissVar and 50% neutral mutations from human/primate sequence differences in orthologous proteins. Copyright © 2012 Wiley Periodicals, Inc.
ERIC Educational Resources Information Center
Batzli, Janet M.
2005-01-01
''Why four semesters? How does this track differ from the two-semester course sequence?'' These are the most common questions students have when they learn about the Biology Core Curriculum (Biocore), a unique four-semester honors biology sequence at University of Wisconsin-Madison (UW-Madison). Biocore was first taught at University of Wisconsin…
Nonparametric Combinatorial Sequence Models
NASA Astrophysics Data System (ADS)
Wauthier, Fabian L.; Jordan, Michael I.; Jojic, Nebojsa
This work considers biological sequences that exhibit combinatorial structures in their composition: groups of positions of the aligned sequences are "linked" and covary as one unit across sequences. If multiple such groups exist, complex interactions can emerge between them. Sequences of this kind arise frequently in biology but methodologies for analyzing them are still being developed. This paper presents a nonparametric prior on sequences which allows combinatorial structures to emerge and which induces a posterior distribution over factorized sequence representations. We carry out experiments on three sequence datasets which indicate that combinatorial structures are indeed present and that combinatorial sequence models can more succinctly describe them than simpler mixture models. We conclude with an application to MHC binding prediction which highlights the utility of the posterior distribution induced by the prior. By integrating out the posterior our method compares favorably to leading binding predictors.
String Mining in Bioinformatics
NASA Astrophysics Data System (ADS)
Abouelhoda, Mohamed; Ghanem, Moustafa
Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word "data-mining" is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
String Mining in Bioinformatics
NASA Astrophysics Data System (ADS)
Abouelhoda, Mohamed; Ghanem, Moustafa
Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word “data-mining” is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
Korber, B T; Osmanov, S; Esparza, J; Myers, G
1994-11-01
The World Health Organization Global Programme on AIDS (WHO/GPA) is conducting a large-scale collaborative study of human immunodeficiency virus type 1 (HIV-1) variation, based in four potential vaccine-trial site countries: Brazil, Rwanda, Thailand, and Uganda. Through the course of this study, it was crucial to keep track of certain attributes of the samples from which the viral nucleotide sequences were derived (e.g., country of origin and viral culture characterization), so that meaningful sequence comparisons could be made. Here we describe a system developed in the context of the WHO/GPA study that summarizes such critical attributes by representing them as standardized characters directly incorporated into sequence names. This nomenclature allows linkage of clinical, phenotypic, and geographic information with molecular data. We propose that other investigators involved in human immunodeficiency virus (HIV) nucleotide sequencing efforts adopt a similar standardized sequence nomenclature to facilitate cross-study sequence comparison. HIV sequence data are being generated at an ever-increasing rate; directly coupled to this increase is our deepening understanding of biological parameters that influence or result from sequence variability. A standardized sequence nomenclature that includes relevant biological information would enable researchers to better utilize the growing body of sequence data, and enhance their ability to interpret the biological implications of their own data through facilitating comparisons with previously published work.
Cost-effectiveness of biological treatment sequences for fistulising Crohn’s disease across Europe
Baji, Petra; Gulácsi, László; Brodszky, Valentin; Végh, Zsuzsanna; Danese, Silvio; Irving, Peter M; Peyrin-Biroulet, Laurent; Schreiber, Stefan; Rencz, Fanni; Lakatos, Péter L; Péntek, Márta
2017-01-01
Background In clinical practice, treatment sequences of biologicals are applied for active fistulising Crohn’s disease, however underlying health economic analyses are lacking. Objective The purpose of this study was to analyse the cost-effectiveness of different biological sequences including infliximab, biosimilar-infliximab, adalimumab and vedolizumab in nine European countries. Methods A Markov model was developed to compare treatment sequences of one, two and three biologicals from the payer’s perspective on a five-year time horizon. Data on effectiveness and health state utilities were obtained from the literature. Country-specific costs were considered. Calculations were performed with both official list prices and estimated real prices of biologicals. Results Biosimilar-infliximab is the most cost-effective treatment against standard care across the countries (with list prices: €34684–€72551/quality adjusted life year; with estimated real prices: €24364–€56086/quality adjusted life year). The most cost-effective two-agent sequence, except for Germany, is the biosimilar-infliximab–adalimumab therapy compared with single biosimilar-infliximab (with list prices: €58533–€133831/quality adjusted life year; with estimated prices: €45513–€105875/quality adjusted life year). The cost-effectiveness of the biosimilar-infliximab–adalimumab–vedolizumab three-agent sequence compared wit biosimilar-infliximab –adalimumab is €87214–€152901/quality adjusted life year. Conclusions The suggested first-choice biological treatment is biosimilar-infliximab. In case of treatment failure, switching to adalimumab then to vedolizumab provides meaningful additional health gains but at increased costs. Inter-country differences in cost-effectiveness are remarkable due to significant differences in costs. PMID:29511561
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences.
Mehdi, Ahmed M; Sehgal, Muhammad Shoaib B; Kobe, Bostjan; Bailey, Timothy L; Bodén, Mikael
2013-01-01
Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. http://bioinf.scmb.uq.edu.au/dlocalmotif/
Introduction to Single-Cell RNA Sequencing.
Olsen, Thale Kristin; Baryawno, Ninib
2018-04-01
During the last decade, high-throughput sequencing methods have revolutionized the entire field of biology. The opportunity to study entire transcriptomes in great detail using RNA sequencing (RNA-seq) has fueled many important discoveries and is now a routine method in biomedical research. However, RNA-seq is typically performed in "bulk," and the data represent an average of gene expression patterns across thousands to millions of cells; this might obscure biologically relevant differences between cells. Single-cell RNA-seq (scRNA-seq) represents an approach to overcome this problem. By isolating single cells, capturing their transcripts, and generating sequencing libraries in which the transcripts are mapped to individual cells, scRNA-seq allows assessment of fundamental biological properties of cell populations and biological systems at unprecedented resolution. Here, we present the most common scRNA-seq protocols in use today and the basics of data analysis and discuss factors that are important to consider before planning and designing an scRNA-seq project. © 2018 by John Wiley & Sons, Inc. Copyright © 2018 John Wiley & Sons, Inc.
Introduction to bioinformatics.
Can, Tolga
2014-01-01
Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.
The transcriptome of Lutzomyia longipalpis (Diptera: Psychodidae) male reproductive organs.
Azevedo, Renata V D M; Dias, Denise B S; Bretãs, Jorge A C; Mazzoni, Camila J; Souza, Nataly A; Albano, Rodolpho M; Wagner, Glauber; Davila, Alberto M R; Peixoto, Alexandre A
2012-01-01
It has been suggested that genes involved in the reproductive biology of insect disease vectors are potential targets for future alternative methods of control. Little is known about the molecular biology of reproduction in phlebotomine sand flies and there is no information available concerning genes that are expressed in male reproductive organs of Lutzomyia longipalpis, the main vector of American visceral leishmaniasis and a species complex. We generated 2678 high quality ESTs ("Expressed Sequence Tags") of L. longipalpis male reproductive organs that were grouped in 1391 non-redundant sequences (1136 singlets and 255 clusters). BLAST analysis revealed that only 57% of these sequences share similarity with a L. longipalpis female EST database. Although no more than 36% of the non-redundant sequences showed similarity to protein sequences deposited in databases, more than half of them presented the best-match hits with mosquito genes. Gene ontology analysis identified subsets of genes involved in biological processes such as protein biosynthesis and DNA replication, which are probably associated with spermatogenesis. A number of non-redundant sequences were also identified as putative male reproductive gland proteins (mRGPs), also known as male accessory gland protein genes (Acps). The transcriptome analysis of L. longipalpis male reproductive organs is one step further in the study of the molecular basis of the reproductive biology of this important species complex. It has allowed the identification of genes potentially involved in spermatogenesis as well as putative mRGPs sequences, which have been studied in many insect species because of their effects on female post-mating behavior and physiology and their potential role in sexual selection and speciation. These data open a number of new avenues for further research in the molecular and evolutionary reproductive biology of sand flies.
The Transcriptome of Lutzomyia longipalpis (Diptera: Psychodidae) Male Reproductive Organs
Bretãs, Jorge A. C.; Mazzoni, Camila J.; Souza, Nataly A.; Albano, Rodolpho M.; Wagner, Glauber; Davila, Alberto M. R.; Peixoto, Alexandre A.
2012-01-01
Background It has been suggested that genes involved in the reproductive biology of insect disease vectors are potential targets for future alternative methods of control. Little is known about the molecular biology of reproduction in phlebotomine sand flies and there is no information available concerning genes that are expressed in male reproductive organs of Lutzomyia longipalpis, the main vector of American visceral leishmaniasis and a species complex. Methods/Principal Findings We generated 2678 high quality ESTs (“Expressed Sequence Tags”) of L. longipalpis male reproductive organs that were grouped in 1391 non-redundant sequences (1136 singlets and 255 clusters). BLAST analysis revealed that only 57% of these sequences share similarity with a L. longipalpis female EST database. Although no more than 36% of the non-redundant sequences showed similarity to protein sequences deposited in databases, more than half of them presented the best-match hits with mosquito genes. Gene ontology analysis identified subsets of genes involved in biological processes such as protein biosynthesis and DNA replication, which are probably associated with spermatogenesis. A number of non-redundant sequences were also identified as putative male reproductive gland proteins (mRGPs), also known as male accessory gland protein genes (Acps). Conclusions The transcriptome analysis of L. longipalpis male reproductive organs is one step further in the study of the molecular basis of the reproductive biology of this important species complex. It has allowed the identification of genes potentially involved in spermatogenesis as well as putative mRGPs sequences, which have been studied in many insect species because of their effects on female post-mating behavior and physiology and their potential role in sexual selection and speciation. These data open a number of new avenues for further research in the molecular and evolutionary reproductive biology of sand flies. PMID:22496818
Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars
Cai, Yizhi; Lux, Matthew W.; Adam, Laura; Peccoud, Jean
2009-01-01
Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. These capabilities are illustrated by simple example grammars expressing how gene expression rates are dependent upon single or multiple parts. The translation process is validated by systematically generating, translating, and simulating the phenotype of all the sequences in the design space generated by a small library of genetic parts. Attribute grammars represent a flexible framework connecting parts with models of biological function. They will be instrumental for building mathematical models of libraries of genetic constructs synthesized to characterize the function of genetic parts. This formalism is also expected to provide a solid foundation for the development of computer assisted design applications for synthetic biology. PMID:19816554
Analysis of plant microbe interactions in the era of next generation sequencing technologies
Knief, Claudia
2014-01-01
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods. PMID:24904612
Wang, Edwin; Zou, Jinfeng; Zaman, Naif; Beitel, Lenore K; Trifiro, Mark; Paliouras, Miltiadis
2013-08-01
Recent tumor genome sequencing confirmed that one tumor often consists of multiple cell subpopulations (clones) which bear different, but related, genetic profiles such as mutation and copy number variation profiles. Thus far, one tumor has been viewed as a whole entity in cancer functional studies. With the advances of genome sequencing and computational analysis, we are able to quantify and computationally dissect clones from tumors, and then conduct clone-based analysis. Emerging technologies such as single-cell genome sequencing and RNA-Seq could profile tumor clones. Thus, we should reconsider how to conduct cancer systems biology studies in the genome sequencing era. We will outline new directions for conducting cancer systems biology by considering that genome sequencing technology can be used for dissecting, quantifying and genetically characterizing clones from tumors. Topics discussed in Part 1 of this review include computationally quantifying of tumor subpopulations; clone-based network modeling, cancer hallmark-based networks and their high-order rewiring principles and the principles of cell survival networks of fast-growing clones. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Ma, Lijun; Lee, Letitia; Barani, Igor; Hwang, Andrew; Fogh, Shannon; Nakamura, Jean; McDermott, Michael; Sneed, Penny; Larson, David A; Sahgal, Arjun
2011-11-21
Rapid delivery of multiple shots or isocenters is one of the hallmarks of Gamma Knife radiosurgery. In this study, we investigated whether the temporal order of shots delivered with Gamma Knife Perfexion would significantly influence the biological equivalent dose for complex multi-isocenter treatments. Twenty single-target cases were selected for analysis. For each case, 3D dose matrices of individual shots were extracted and single-fraction equivalent uniform dose (sEUD) values were determined for all possible shot delivery sequences, corresponding to different patterns of temporal dose delivery within the target. We found significant variations in the sEUD values among these sequences exceeding 15% for certain cases. However, the sequences for the actual treatment delivery were found to agree (<3%) and to correlate (R² = 0.98) excellently with the sequences yielding the maximum sEUD values for all studied cases. This result is applicable for both fast and slow growing tumors with α/β values of 2 to 20 according to the linear-quadratic model. In conclusion, despite large potential variations in different shot sequences for multi-isocenter Gamma Knife treatments, current clinical delivery sequences exhibited consistent biological target dosing that approached that maximally achievable for all studied cases.
Function-Based Algorithms for Biological Sequences
ERIC Educational Resources Information Center
Mohanty, Pragyan Sheela P.
2015-01-01
Two problems at two different abstraction levels of computational biology are studied. At the molecular level, efficient pattern matching algorithms in DNA sequences are presented. For gene order data, an efficient data structure is presented capable of storing all gene re-orderings in a systematic manner. A common characteristic of presented…
Finding the missing honey bee genes: lessons learned from a genome upgrade
USDA-ARS?s Scientific Manuscript database
The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. ...
NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.
Liu, Sophia S; Hockenberry, Adam J; Lancichinetti, Andrea; Jewett, Michael C; Amaral, Luís A N
2016-11-01
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.
Wright, Imogen A.; Travers, Simon A.
2014-01-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. PMID:24861618
Metagenomic ventures into outer sequence space.
Dutilh, Bas E
Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.
Robust k-mer frequency estimation using gapped k-mers
Ghandi, Mahmoud; Mohammad-Noori, Morteza
2013-01-01
Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome. PMID:23861010
Robust k-mer frequency estimation using gapped k-mers.
Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael A
2014-08-01
Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.
DNA capture elements for rapid detection and identification of biological agents
NASA Astrophysics Data System (ADS)
Kiel, Johnathan L.; Parker, Jill E.; Holwitt, Eric A.; Vivekananda, Jeeva
2004-08-01
DNA capture elements (DCEs; aptamers) are artificial DNA sequences, from a random pool of sequences, selected for their specific binding to potential biological warfare agents. These sequences were selected by an affinity method using filters to which the target agent was attached and the DNA isolated and amplified by polymerase chain reaction (PCR) in an iterative, increasingly stringent, process. Reporter molecules were attached to the finished sequences. To date, we have made DCEs to Bacillus anthracis spores, Shiga toxin, Venezuelan Equine Encephalitis (VEE) virus, and Francisella tularensis. These DCEs have demonstrated specificity and sensitivity equal to or better than antibody.
Modular protein domains: an engineering approach toward functional biomaterials.
Lin, Charng-Yu; Liu, Julie C
2016-08-01
Protein domains and peptide sequences are a powerful tool for conferring specific functions to engineered biomaterials. Protein sequences with a wide variety of functionalities, including structure, bioactivity, protein-protein interactions, and stimuli responsiveness, have been identified, and advances in molecular biology continue to pinpoint new sequences. Protein domains can be combined to make recombinant proteins with multiple functionalities. The high fidelity of the protein translation machinery results in exquisite control over the sequence of recombinant proteins and the resulting properties of protein-based materials. In this review, we discuss protein domains and peptide sequences in the context of functional protein-based materials, composite materials, and their biological applications. Copyright © 2016 Elsevier Ltd. All rights reserved.
Bastien, Olivier; Maréchal, Eric
2008-08-07
Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support. We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the information hazard rate, and that pairwise sequence alignment scores should follow a Gumbel distribution, which parameters could find some theoretical rationale. In particular, one parameter corresponds to the information hazard rate. Extreme value distribution of alignment scores, assessed from high scoring segments pairs following the Karlin-Altschul model, can also be deduced from the Reliability Theory applied to molecular sequences. It reflects the redundancy of information between homologous sequences, under functional conservative pressure. This model also provides a link between concepts of biological sequence analysis and of systems biology.
Teaching Biology around Themes: Teach Proteins and DNA Together.
ERIC Educational Resources Information Center
Offner, Susan
1992-01-01
Proposes as a unifying theme for high school biology the question of "how chromosomes determine what we are." Describes a sequence of lessons in which students learn about proteins, enzymes, and amino acids. Includes three dry laboratory exercises to demonstrate the DNA sequences for sickle cell anemia and cystic fibrosis. (MDH)
Targeted enrichment strategies for next-generation plant biology
Richard Cronn; Brian J. Knaus; Aaron Liston; Peter J. Maughan; Matthew Parks; John V. Syring; Joshua Udall
2012-01-01
The dramatic advances offered by modem DNA sequencers continue to redefine the limits of what can be accomplished in comparative plant biology. Even with recent achievements, however, plant genomes present obstacles that can make it difficult to execute large-scale population and phylogenetic studies on next-generation sequencing platforms. Factors like large genome...
Enabling plant synthetic biology through genome engineering.
Baltes, Nicholas J; Voytas, Daniel F
2015-02-01
Synthetic biology seeks to create new biological systems, including user-designed plants and plant cells. These systems can be employed for a variety of purposes, ranging from producing compounds of industrial or therapeutic value, to reducing crop losses by altering cellular responses to pathogens or climate change. To realize the full potential of plant synthetic biology, techniques are required that provide control over the genetic code - enabling targeted modifications to DNA sequences within living plant cells. Such control is now within reach owing to recent advances in the use of sequence-specific nucleases to precisely engineer genomes. We discuss here the enormous potential provided by genome engineering for plant synthetic biology. Copyright © 2014 Elsevier Ltd. All rights reserved.
The computational linguistics of biological sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Searls, D.
1995-12-31
This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. Protein sequences are analogous in many respects, particularly their folding behavior. Proteins have a much richer variety of interactions, but in theory the same linguistic principles could come to bear in describing dependencies between distant residues that arise by virtue of three-dimensional structure. This tutorial will concentrate on nucleic acid sequences.
Makiguchi, Wataru; Tanabe, Junki; Yamada, Hidekazu; Iida, Hiroki; Taura, Daisuke; Ousaka, Naoki; Yashima, Eiji
2015-01-01
Self-recognition and self-discrimination within complex mixtures are of fundamental importance in biological systems, which entirely rely on the preprogrammed monomer sequences and homochirality of biological macromolecules. Here we report artificial chirality- and sequence-selective successive self-sorting of chiral dimeric strands bearing carboxylic acid or amidine groups joined by chiral amide linkers with different sequences through homo- and complementary-duplex formations. A mixture of carboxylic acid dimers linked by racemic-1,2-cyclohexane bis-amides with different amide sequences (NHCO or CONH) self-associate to form homoduplexes in a completely sequence-selective way, the structures of which are different from each other depending on the linker amide sequences. The further addition of an enantiopure amide-linked amidine dimer to a mixture of the racemic carboxylic acid dimers resulted in the formation of a single optically pure complementary duplex with a 100% diastereoselectivity and complete sequence specificity stabilized by the amidinium–carboxylate salt bridges, leading to the perfect chirality- and sequence-selective duplex formation. PMID:26051291
Protein interface classification by evolutionary analysis
2012-01-01
Background Distinguishing biologically relevant interfaces from lattice contacts in protein crystals is a fundamental problem in structural biology. Despite efforts towards the computational prediction of interface character, many issues are still unresolved. Results We present here a protein-protein interface classifier that relies on evolutionary data to detect the biological character of interfaces. The classifier uses a simple geometric measure, number of core residues, and two evolutionary indicators based on the sequence entropy of homolog sequences. Both aim at detecting differential selection pressure between interface core and rim or rest of surface. The core residues, defined as fully buried residues (>95% burial), appear to be fundamental determinants of biological interfaces: their number is in itself a powerful discriminator of interface character and together with the evolutionary measures it is able to clearly distinguish evolved biological contacts from crystal ones. We demonstrate that this definition of core residues leads to distinctively better results than earlier definitions from the literature. The stringent selection and quality filtering of structural and sequence data was key to the success of the method. Most importantly we demonstrate that a more conservative selection of homolog sequences - with relatively high sequence identities to the query - is able to produce a clearer signal than previous attempts. Conclusions An evolutionary approach like the one presented here is key to the advancement of the field, which so far was missing an effective method exploiting the evolutionary character of protein interfaces. Its coverage and performance will only improve over time thanks to the incessant growth of sequence databases. Currently our method reaches an accuracy of 89% in classifying interfaces of the Ponstingl 2003 datasets and it lends itself to a variety of useful applications in structural biology and bioinformatics. We made the corresponding software implementation available to the community as an easy-to-use graphical web interface at http://www.eppic-web.org. PMID:23259833
Homology and phylogeny and their automated inference
NASA Astrophysics Data System (ADS)
Fuellen, Georg
2008-06-01
The analysis of the ever-increasing amount of biological and biomedical data can be pushed forward by comparing the data within and among species. For example, an integrative analysis of data from the genome sequencing projects for various species traces the evolution of the genomes and identifies conserved and innovative parts. Here, I review the foundations and advantages of this “historical” approach and evaluate recent attempts at automating such analyses. Biological data is comparable if a common origin exists (homology), as is the case for members of a gene family originating via duplication of an ancestral gene. If the family has relatives in other species, we can assume that the ancestral gene was present in the ancestral species from which all the other species evolved. In particular, describing the relationships among the duplicated biological sequences found in the various species is often possible by a phylogeny, which is more informative than homology statements. Detecting and elaborating on common origins may answer how certain biological sequences developed, and predict what sequences are in a particular species and what their function is. Such knowledge transfer from sequences in one species to the homologous sequences of the other is based on the principle of ‘my closest relative looks and behaves like I do’, often referred to as ‘guilt by association’. To enable knowledge transfer on a large scale, several automated ‘phylogenomics pipelines’ have been developed in recent years, and seven of these will be described and compared. Overall, the examples in this review demonstrate that homology and phylogeny analyses, done on a large (and automated) scale, can give insights into function in biology and biomedicine.
Barnes, D W
2012-04-01
Two of the most commonly used elasmobranch experimental model species are the spiny dogfish Squalus acanthias and the little skate Leucoraja erinacea. Comparative biology and genomics with these species have provided useful information in physiology, pharmacology, toxicology, immunology, evolutionary developmental biology and genetics. A wealth of information has been obtained using in vitro approaches to study isolated cells and tissues from these organisms under circumstances in which the extracellular environment can be controlled. In addition to classical work with primary cell cultures, continuously proliferating cell lines have been derived recently, representing the first cell lines from cartilaginous fishes. These lines have proved to be valuable tools with which to explore functional genomic and biological questions and to test hypotheses at the molecular level. In genomic experiments, complementary (c)DNA libraries have been constructed, and c. 8000 unique transcripts identified, with over 3000 representing previously unknown gene sequences. A sub-set of messenger (m)RNAs has been detected for which the 3' untranslated regions show elements that are remarkably well conserved evolutionarily, representing novel, potentially regulatory gene sequences. The cell culture systems provide physiologically valid tools to study functional roles of these sequences and other aspects of elasmobranch molecular cell biology and physiology. Information derived from the use of in vitro cell cultures is valuable in revealing gene diversity and information for genomic sequence assembly, as well as for identification of new genes and molecular markers, construction of gene-array probes and acquisition of full-length cDNA sequences. © 2012 The Author. Journal of Fish Biology © 2012 The Fisheries Society of the British Isles.
NASA Astrophysics Data System (ADS)
Ma, Lijun; Lee, Letitia; Barani, Igor; Hwang, Andrew; Fogh, Shannon; Nakamura, Jean; McDermott, Michael; Sneed, Penny; Larson, David A.; Sahgal, Arjun
2011-11-01
Rapid delivery of multiple shots or isocenters is one of the hallmarks of Gamma Knife radiosurgery. In this study, we investigated whether the temporal order of shots delivered with Gamma Knife Perfexion would significantly influence the biological equivalent dose for complex multi-isocenter treatments. Twenty single-target cases were selected for analysis. For each case, 3D dose matrices of individual shots were extracted and single-fraction equivalent uniform dose (sEUD) values were determined for all possible shot delivery sequences, corresponding to different patterns of temporal dose delivery within the target. We found significant variations in the sEUD values among these sequences exceeding 15% for certain cases. However, the sequences for the actual treatment delivery were found to agree (<3%) and to correlate (R2 = 0.98) excellently with the sequences yielding the maximum sEUD values for all studied cases. This result is applicable for both fast and slow growing tumors with α/β values of 2 to 20 according to the linear-quadratic model. In conclusion, despite large potential variations in different shot sequences for multi-isocenter Gamma Knife treatments, current clinical delivery sequences exhibited consistent biological target dosing that approached that maximally achievable for all studied cases.
Wright, Imogen A; Travers, Simon A
2014-07-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Hidden Markov models of biological primary sequence information.
Baldi, P; Chauvin, Y; Hunkapiller, T; McClure, M A
1994-01-01
Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences. PMID:8302831
The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution
USDA-ARS?s Scientific Manuscript database
As a major step toward understanding the biology and evolution of ruminants, the cattle genome was sequenced to ~7x coverage using a combined whole genome shotgun and BAC skim approach. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs found in seven mammalian...
Inverse statistical physics of protein sequences: a key issues review.
Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin
2018-03-01
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
Inverse statistical physics of protein sequences: a key issues review
NASA Astrophysics Data System (ADS)
Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin
2018-03-01
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
Dong, Zheng; Zhou, Hongyu; Tao, Peng
2018-02-01
PAS domains are widespread in archaea, bacteria, and eukaryota, and play important roles in various functions. In this study, we aim to explore functional evolutionary relationship among proteins in the PAS domain superfamily in view of the sequence-structure-dynamics-function relationship. We collected protein sequences and crystal structure data from RCSB Protein Data Bank of the PAS domain superfamily belonging to three biological functions (nucleotide binding, photoreceptor activity, and transferase activity). Protein sequences were aligned and then used to select sequence-conserved residues and build phylogenetic tree. Three-dimensional structure alignment was also applied to obtain structure-conserved residues. The protein dynamics were analyzed using elastic network model (ENM) and validated by molecular dynamics (MD) simulation. The result showed that the proteins with same function could be grouped by sequence similarity, and proteins in different functional groups displayed statistically significant difference in their vibrational patterns. Interestingly, in all three functional groups, conserved amino acid residues identified by sequence and structure conservation analysis generally have a lower fluctuation than other residues. In addition, the fluctuation of conserved residues in each biological function group was strongly correlated with the corresponding biological function. This research suggested a direct connection in which the protein sequences were related to various functions through structural dynamics. This is a new attempt to delineate functional evolution of proteins using the integrated information of sequence, structure, and dynamics. © 2017 The Protein Society.
Biology First: A History of the Grade Placement of High School Biology
ERIC Educational Resources Information Center
Sheppard, Keith; Robbins, Dennis M.
2006-01-01
This article outlines the history of the high school "general biology" course and details how biology came to be placed first in the traditional order of science subjects (biology-chemistry-physics). The article briefly discusses the implications of the development of this sequence for the present day biology course.
NASA Astrophysics Data System (ADS)
Yang, Hong
Until recently, recovery and analysis of genetic information encoded in ancient DNA sequences from Pleistocene fossils were impossible. Recent advances in molecular biology offered technical tools to obtain ancient DNA sequences from well-preserved Quaternary fossils and opened the possibilities to directly study genetic changes in fossil species to address various biological and paleontological questions. Ancient DNA studies involving Pleistocene fossil material and ancient DNA degradation and preservation in Quaternary deposits are reviewed. The molecular technology applied to isolate, amplify, and sequence ancient DNA is also presented. Authentication of ancient DNA sequences and technical problems associated with modern and ancient DNA contamination are discussed. As illustrated in recent studies on ancient DNA from proboscideans, it is apparent that fossil DNA sequence data can shed light on many aspects of Quaternary research such as systematics and phylogeny. conservation biology, evolutionary theory, molecular taphonomy, and forensic sciences. Improvement of molecular techniques and a better understanding of DNA degradation during fossilization are likely to build on current strengths and to overcome existing problems, making fossil DNA data a unique source of information for Quaternary scientists.
The use of museum specimens with high-throughput DNA sequencers
Burrell, Andrew S.; Disotell, Todd R.; Bergey, Christina M.
2015-01-01
Natural history collections have long been used by morphologists, anatomists, and taxonomists to probe the evolutionary process and describe biological diversity. These biological archives also offer great opportunities for genetic research in taxonomy, conservation, systematics, and population biology. They allow assays of past populations, including those of extinct species, giving context to present patterns of genetic variation and direct measures of evolutionary processes. Despite this potential, museum specimens are difficult to work with because natural postmortem processes and preservation methods fragment and damage DNA. These problems have restricted geneticists’ ability to use natural history collections primarily by limiting how much of the genome can be surveyed. Recent advances in DNA sequencing technology, however, have radically changed this, making truly genomic studies from museum specimens possible. We review the opportunities and drawbacks of the use of museum specimens, and suggest how to best execute projects when incorporating such samples. Several high-throughput (HT) sequencing methodologies, including whole genome shotgun sequencing, sequence capture, and restriction digests (demonstrated here), can be used with archived biomaterials. PMID:25532801
USDA-ARS?s Scientific Manuscript database
Coat protein sequences of 33 Potyvirus isolates from legume and Passiflora spp. were sequenced to determine the identity of infecting viruses. Phylogenetic analysis of the sequences revealed the presence of seven distinct virus species....
A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences.
Yu, Jia-Feng; Dou, Xiang-Hua; Wang, Hong-Bo; Sun, Xiao; Zhao, Hui-Ying; Wang, Ji-Hua
2015-06-22
The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp .
Rissing, Steven W
2013-01-01
Most American colleges and universities offer gateway biology courses to meet the needs of three undergraduate audiences: biology and related science majors, many of whom will become biomedical researchers; premedical students meeting medical school requirements and preparing for the Medical College Admissions Test (MCAT); and students completing general education (GE) graduation requirements. Biology textbooks for these three audiences present a topic scope and sequence that correlates with the topic scope and importance ratings of the biology content specifications for the MCAT regardless of the intended audience. Texts for "nonmajors," GE courses appear derived directly from their publisher's majors text. Topic scope and sequence of GE texts reflect those of "their" majors text and, indirectly, the MCAT. MCAT term density of GE texts equals or exceeds that of their corresponding majors text. Most American universities require a GE curriculum to promote a core level of academic understanding among their graduates. This includes civic scientific literacy, recognized as an essential competence for the development of public policies in an increasingly scientific and technological world. Deriving GE biology and related science texts from majors texts designed to meet very different learning objectives may defeat the scientific literacy goals of most schools' GE curricula.
Rissing, Steven W.
2013-01-01
Most American colleges and universities offer gateway biology courses to meet the needs of three undergraduate audiences: biology and related science majors, many of whom will become biomedical researchers; premedical students meeting medical school requirements and preparing for the Medical College Admissions Test (MCAT); and students completing general education (GE) graduation requirements. Biology textbooks for these three audiences present a topic scope and sequence that correlates with the topic scope and importance ratings of the biology content specifications for the MCAT regardless of the intended audience. Texts for “nonmajors,” GE courses appear derived directly from their publisher's majors text. Topic scope and sequence of GE texts reflect those of “their” majors text and, indirectly, the MCAT. MCAT term density of GE texts equals or exceeds that of their corresponding majors text. Most American universities require a GE curriculum to promote a core level of academic understanding among their graduates. This includes civic scientific literacy, recognized as an essential competence for the development of public policies in an increasingly scientific and technological world. Deriving GE biology and related science texts from majors texts designed to meet very different learning objectives may defeat the scientific literacy goals of most schools’ GE curricula. PMID:24006392
Fuentes-Pardo, Angela P; Ruzzante, Daniel E
2017-10-01
Whole-genome resequencing (WGR) is a powerful method for addressing fundamental evolutionary biology questions that have not been fully resolved using traditional methods. WGR includes four approaches: the sequencing of individuals to a high depth of coverage with either unresolved or resolved haplotypes, the sequencing of population genomes to a high depth by mixing equimolar amounts of unlabelled-individual DNA (Pool-seq) and the sequencing of multiple individuals from a population to a low depth (lcWGR). These techniques require the availability of a reference genome. This, along with the still high cost of shotgun sequencing and the large demand for computing resources and storage, has limited their implementation in nonmodel species with scarce genomic resources and in fields such as conservation biology. Our goal here is to describe the various WGR methods, their pros and cons and potential applications in conservation biology. WGR offers an unprecedented marker density and surveys a wide diversity of genetic variations not limited to single nucleotide polymorphisms (e.g., structural variants and mutations in regulatory elements), increasing their power for the detection of signatures of selection and local adaptation as well as for the identification of the genetic basis of phenotypic traits and diseases. Currently, though, no single WGR approach fulfils all requirements of conservation genetics, and each method has its own limitations and sources of potential bias. We discuss proposed ways to minimize such biases. We envision a not distant future where the analysis of whole genomes becomes a routine task in many nonmodel species and fields including conservation biology. © 2017 John Wiley & Sons Ltd.
From non-random molecular structure to life and mind
NASA Technical Reports Server (NTRS)
Fox, S. W.
1989-01-01
The evolutionary hierarchy molecular structure-->macromolecular structure-->protobiological structure-->biological structure-->biological functions has been traced by experiments. The sequence always moves through protein. Extension of the experiments traces the formation of nucleic acids instructed by proteins. The proteins themselves were, in this picture, instructed by the self-sequencing of precursor amino acids. While the sequence indicated explains the thread of the emergence of life, protein in cellular membrane also provides the only known material basis for the emergence of mind in the context of emergence of life.
Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse
Hillier, LaDeana W.; Zody, Michael C.; Goldstein, Steve; She, Xinwe; Bult, Carol J.; Agarwala, Richa; Cherry, Joshua L.; DiCuccio, Michael; Hlavina, Wratko; Kapustin, Yuri; Meric, Peter; Maglott, Donna; Birtle, Zoë; Marques, Ana C.; Graves, Tina; Zhou, Shiguo; Teague, Brian; Potamousis, Konstantinos; Churas, Christopher; Place, Michael; Herschleb, Jill; Runnheim, Ron; Forrest, Daniel; Amos-Landgraf, James; Schwartz, David C.; Cheng, Ze; Lindblad-Toh, Kerstin; Eichler, Evan E.; Ponting, Chris P.
2009-01-01
The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not. PMID:19468303
Genome Sequences of Three Strains of Aspergillus flavus for the Biological Control of Aflatoxin
Scheffler, Brian E.; Duke, Mary; Ballard, Linda; Abbas, Hamed K.; Grodowitz, Michael J.
2017-01-01
ABSTRACT Aflatoxin is a carcinogenic contaminant of many commodities that are infected by Aspergillus flavus. Nonaflatoxigenic strains of A. flavus have been utilized as biological control agents. Here, we report the genome sequences from three biocontrol strains. This information will be useful in developing markers for postrelease monitoring of these fungi. PMID:29097466
ERIC Educational Resources Information Center
Douglass, Claudia B.
The primary purpose of the reported study was to identify a possible interaction between the cognitive style of the students and the instructional sequence of the materials and their combined effect on achievement. The subjects were 627 biology students from six midwestern high schools. The students were ranked and classified as field-dependent…
USDA-ARS?s Scientific Manuscript database
High-throughput sequencing is often used for studies of the transcriptome, particularly for comparisons between experimental conditions. Due to sequencing costs, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential ex...
Taylor, Brandie D; Zheng, Xiaojing; Darville, Toni; Zhong, Wujuan; Konganti, Kranti; Abiodun-Ojo, Olayinka; Ness, Roberta B; O'Connell, Catherine M; Haggerty, Catherine L
2017-01-01
Ideal management of sexually transmitted infections (STI) may require risk markers for pathology or vaccine development. Previously, we identified common genetic variants associated with chlamydial pelvic inflammatory disease (PID) and reduced fecundity. As this explains only a proportion of the long-term morbidity risk, we used whole-exome sequencing to identify biological pathways that may be associated with STI-related infertility. We obtained stored DNA from 43 non-Hispanic black women with PID from the PID Evaluation and Clinical Health Study. Infertility was assessed at a mean of 84 months. Principal component analysis revealed no population stratification. Potential covariates did not significantly differ between groups. Sequencing kernel association test was used to examine associations between aggregates of variants on a single gene and infertility. The results from the sequencing kernel association test were used to choose "focus genes" (P < 0.01; n = 150) for subsequent Ingenuity Pathway Analysis to identify "gene sets" that are enriched in biologically relevant pathways. Pathway analysis revealed that focus genes were enriched in canonical pathways including, IL-1 signaling, P2Y purinergic receptor signaling, and bone morphogenic protein signaling. Focus genes were enriched in pathways that impact innate and adaptive immunity, protein kinase A activity, cellular growth, and DNA repair. These may alter host resistance or immunopathology after infection. Targeted sequencing of biological pathways identified in this study may provide insight into STI-related infertility.
antaRNA: ant colony-based RNA sequence design.
Kleinkauf, Robert; Mann, Martin; Backofen, Rolf
2015-10-01
RNA sequence design is studied at least as long as the classical folding problem. Although for the latter the functional fold of an RNA molecule is to be found ,: inverse folding tries to identify RNA sequences that fold into a function-specific target structure. In combination with RNA-based biotechnology and synthetic biology ,: reliable RNA sequence design becomes a crucial step to generate novel biochemical components. In this article ,: the computational tool antaRNA is presented. It is capable of compiling RNA sequences for a given structure that comply in addition with an adjustable full range objective GC-content distribution ,: specific sequence constraints and additional fuzzy structure constraints. antaRNA applies ant colony optimization meta-heuristics and its superior performance is shown on a biological datasets. http://www.bioinf.uni-freiburg.de/Software/antaRNA CONTACT: backofen@informatik.uni-freiburg.de Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Beltman, Joost B; Urbanus, Jos; Velds, Arno; van Rooij, Nienke; Rohr, Jan C; Naik, Shalin H; Schumacher, Ton N
2016-04-02
Next generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags. Here, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences. Application of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.
Integrating Mathematics into the Introductory Biology Laboratory Course
ERIC Educational Resources Information Center
White, James D.; Carpenter, Jenna P.
2008-01-01
Louisiana Tech University has an integrated science curriculum for its mathematics, chemistry, physics, computer science, biology-research track and secondary mathematics and science education majors. The curriculum focuses on the calculus sequence and introductory labs in biology, physics, and chemistry. In the introductory biology laboratory…
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G.; Rigoutsos, Isidore
2017-01-01
Abstract Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. PMID:28108659
2014-01-01
Background Next-generation DNA sequencing (NGS) technologies have made huge impacts in many fields of biological research, but especially in evolutionary biology. One area where NGS has shown potential is for high-throughput sequencing of complete mtDNA genomes (of humans and other animals). Despite the increasing use of NGS technologies and a better appreciation of their importance in answering biological questions, there remain significant obstacles to the successful implementation of NGS-based projects, especially for new users. Results Here we present an ‘A to Z’ protocol for obtaining complete human mitochondrial (mtDNA) genomes – from DNA extraction to consensus sequence. Although designed for use on humans, this protocol could also be used to sequence small, organellar genomes from other species, and also nuclear loci. This protocol includes DNA extraction, PCR amplification, fragmentation of PCR products, barcoding of fragments, sequencing using the 454 GS FLX platform, and a complete bioinformatics pipeline (primer removal, reference-based mapping, output of coverage plots and SNP calling). Conclusions All steps in this protocol are designed to be straightforward to implement, especially for researchers who are undertaking next-generation sequencing for the first time. The molecular steps are scalable to large numbers (hundreds) of individuals and all steps post-DNA extraction can be carried out in 96-well plate format. Also, the protocol has been assembled so that individual ‘modules’ can be swapped out to suit available resources. PMID:24460871
Baichoo, Shakuntala; Ouzounis, Christos A
A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality. Copyright © 2017 Elsevier B.V. All rights reserved.
Stevens, Mark; Viganó, Felicita
2007-04-01
The full-length cDNA of Beet mild yellowing virus (Broom's Barn isolate) was sequenced and cloned into the vector pLitmus 29 (pBMYV-BBfl). The sequence of BMYV-BBfl (5721 bases) shared 96% and 98% nucleotide identity with the other complete sequences of BMYV (BMYV-2ITB, France and BMYV-IPP, Germany respectively). Full-length capped RNA transcripts of pBMYV-BBfl were synthesised and found to be biologically active in Arabidopsis thaliana protoplasts following electroporation or PEG inoculation when the protoplasts were subsequently analysed using serological and molecular methods. The BMYV sequence was modified by inserting DNA that encoded the jellyfish green fluorescent protein (GFP) into the P5 gene close to its 3' end. A. thaliana protoplasts electroporated with these RNA transcripts were biologically active and up to 2% of transfected protoplasts showed GFP-specific fluorescence. The exploitation of these cDNA clones for the study of the biology of beet poleroviruses is discussed.
[Big Data Revolution or Data Hubris? : On the Data Positivism of Molecular Biology].
Gramelsberger, Gabriele
2017-12-01
Genome data, the core of the 2008 proclaimed big data revolution in biology, are automatically generated and analyzed. The transition from the manual laboratory practice of electrophoresis sequencing to automated DNA-sequencing machines and software-based analysis programs was completed between 1982 and 1992. This transition facilitated the first data deluge, which was considerably increased by the second and third generation of DNA-sequencers during the 2000s. However, the strategies for evaluating sequence data were also transformed along with this transition. The paper explores both the computational strategies of automation, as well as the data evaluation culture connected with it, in order to provide a complete picture of the complexity of today's data generation and its intrinsic data positivism. This paper is thereby guided by the question, whether this data positivism is the basis of the big data revolution of molecular biology announced today, or it marks the beginning of its data hubris.
Bhattacharyya, Anamitra; Stilwagen, Stephanie; Reznik, Gary; Feil, Helene; Feil, William S; Anderson, Iain; Bernal, Axel; D'Souza, Mark; Ivanova, Natalia; Kapatral, Vinayak; Larsen, Niels; Los, Tamara; Lykidis, Athanasios; Selkov, Eugene; Walunas, Theresa L; Purcell, Alexander; Edwards, Rob A; Hawkins, Trevor; Haselkorn, Robert; Overbeek, Ross; Kyrpides, Nikos C; Predki, Paul F
2002-10-01
Draft sequencing is a rapid and efficient method for determining the near-complete sequence of microbial genomes. Here we report a comparative analysis of one complete and two draft genome sequences of the phytopathogenic bacterium, Xylella fastidiosa, which causes serious disease in plants, including citrus, almond, and oleander. We present highlights of an in silico analysis based on a comparison of reconstructions of core biological subsystems. Cellular pathway reconstructions have been used to identify a small number of genes, which are likely to reside within the draft genomes but are not captured in the draft assembly. These represented only a small fraction of all genes and were predominantly large and small ribosomal subunit protein components. By using this approach, some of the inherent limitations of draft sequence can be significantly reduced. Despite the incomplete nature of the draft genomes, it is possible to identify several phage-related genes, which appear to be absent from the draft genomes and not the result of insufficient sequence sampling. This region may therefore identify potential host-specific functions. Based on this first functional reconstruction of a phytopathogenic microbe, we spotlight an unusual respiration machinery as a potential target for biological control. We also predicted and developed a new defined growth medium for Xylella.
Quinn, Terrance; Sinkala, Zachariah
2014-01-01
We develop a general method for computing extreme value distribution (Gumbel, 1958) parameters for gapped alignments. Our approach uses mixture distribution theory to obtain associated BLOSUM matrices for gapped alignments, which in turn are used for determining significance of gapped alignment scores for pairs of biological sequences. We compare our results with parameters already obtained in the literature.
International Space Station Research Plan, Assembly Sequence Rev., F
2000-08-01
muscles ü Higher risk for bone fracture upon return to Earth ü Potential for “slipped discs” ü Diminished ability to quickly respond to emergencies...Office of Biological and Physical Research International Space Station Research Plan Assembly Sequence Rev. F, Aug. 2000l . , . Report...Organization Name(s) and Address(es) NASA, Office of Biological and Physical Research Performing Organization Report Number Sponsoring/Monitoring Agency
ERIC Educational Resources Information Center
Flowers, Susan K.; Easter, Carla; Holmes, Andrea; Cohen, Brian; Bednarski, April E.; Mardis, Elaine R.; Wilson, Richard K.; Elgin, Sarah C. R.
2005-01-01
Sequencing of the human genome has ushered in a new era of biology. The technologies developed to facilitate the sequencing of the human genome are now being applied to the sequencing of other genomes. In 2004, a partnership was formed between Washington University School of Medicine Genome Sequencing Center's Outreach Program and Washington…
Method and apparatus for biological sequence comparison
Marr, T.G.; Chang, W.I.
1997-12-23
A method and apparatus are disclosed for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence. 5 figs.
Method and apparatus for biological sequence comparison
Marr, Thomas G.; Chang, William I-Wei
1997-01-01
A method and apparatus for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence.
Folmar, L.D.; Denslow, N.D.; Wallace, R.A.; LaFleur, G.; Gross, T.S.; Bonomelli, S.; Sullivan, C.V.
1995-01-01
N-terminal amino acid sequences for vitellogenin (Vtg) from six species of teleost fish (striped bass, mummichog, pinfish, brown bullhead, medaka, yellow perch and the sturgeon) are compared with published N-terminal Vtg sequences for the lamprey, clawed frog and domestic chicken. Striped bass and mummichog had 100% identical amino acids between positions 7 and 21, while pinfish, brown bullhead, sturgeon, lamprey, Xenopus and chicken had 87%, 93%, 60%, 47%, 47-60%) for four transcripts and had 40% identical, respectively, with striped bass for the same positions. Partial sequences obtained for medaka and yellow perch were 100% identical between positions 5 to 10. The potential utility of this conserved sequence for studies on the biochemistry, molecular biology and pathology of vitellogenesis is discussed.
Sequence-specific bias correction for RNA-seq data using recurrent neural networks.
Zhang, Yao-Zhong; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru
2017-01-25
The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures.
Streamlining the Design-to-Build Transition with Build-Optimization Software Tools.
Oberortner, Ernst; Cheng, Jan-Fang; Hillson, Nathan J; Deutsch, Samuel
2017-03-17
Scaling-up capabilities for the design, build, and test of synthetic biology constructs holds great promise for the development of new applications in fuels, chemical production, or cellular-behavior engineering. Construct design is an essential component in this process; however, not every designed DNA sequence can be readily manufactured, even using state-of-the-art DNA synthesis methods. Current biological computer-aided design and manufacture tools (bioCAD/CAM) do not adequately consider the limitations of DNA synthesis technologies when generating their outputs. Designed sequences that violate DNA synthesis constraints may require substantial sequence redesign or lead to price-premiums and temporal delays, which adversely impact the efficiency of the DNA manufacturing process. We have developed a suite of build-optimization software tools (BOOST) to streamline the design-build transition in synthetic biology engineering workflows. BOOST incorporates knowledge of DNA synthesis success determinants into the design process to output ready-to-build sequences, preempting the need for sequence redesign. The BOOST web application is available at https://boost.jgi.doe.gov and its Application Program Interfaces (API) enable integration into automated, customized DNA design processes. The herein presented results highlight the effectiveness of BOOST in reducing DNA synthesis costs and timelines.
Rényi continuous entropy of DNA sequences.
Vinga, Susana; Almeida, Jonas S
2004-12-07
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
Sequence and Structure Dependent DNA-DNA Interactions
NASA Astrophysics Data System (ADS)
Kopchick, Benjamin; Qiu, Xiangyun
Molecular forces between dsDNA strands are largely dominated by electrostatics and have been extensively studied. Quantitative knowledge has been accumulated on how DNA-DNA interactions are modulated by varied biological constituents such as ions, cationic ligands, and proteins. Despite its central role in biology, the sequence of DNA has not received substantial attention and ``random'' DNA sequences are typically used in biophysical studies. However, ~50% of human genome is composed of non-random-sequence DNAs, particularly repetitive sequences. Furthermore, covalent modifications of DNA such as methylation play key roles in gene functions. Such DNAs with specific sequences or modifications often take on structures other than the canonical B-form. Here we present series of quantitative measurements of the DNA-DNA forces with the osmotic stress method on different DNA sequences, from short repeats to the most frequent sequences in genome, and to modifications such as bromination and methylation. We observe peculiar behaviors that appear to be strongly correlated with the incurred structural changes. We speculate the causalities in terms of the differences in hydration shell and DNA surface structures.
Networking Biology: The Origins of Sequence-Sharing Practices in Genomics.
Stevens, Hallam
2015-10-01
The wide sharing of biological data, especially nucleotide sequences, is now considered to be a key feature of genomics. Historians and sociologists have attempted to account for the rise of this sharing by pointing to precedents in model organism communities and in natural history. This article supplements these approaches by examining the role that electronic networking technologies played in generating the specific forms of sharing that emerged in genomics. The links between early computer users at the Stanford Artificial Intelligence Laboratory in the 1960s, biologists using local computer networks in the 1970s, and GenBank in the 1980s, show how networking technologies carried particular practices of communication, circulation, and data distribution from computing into biology. In particular, networking practices helped to transform sequences themselves into objects that had value as a community resource.
Unravelling biology and shifting paradigms in cancer with single-cell sequencing.
Baslan, Timour; Hicks, James
2017-08-24
The fundamental operative unit of a cancer is the genetically and epigenetically innovative single cell. Whether proliferating or quiescent, in the primary tumour mass or disseminated elsewhere, single cells govern the parameters that dictate all facets of the biology of cancer. Thus, single-cell analyses provide the ultimate level of resolution in our quest for a fundamental understanding of this disease. Historically, this quest has been hampered by technological shortcomings. In this Opinion article, we argue that the rapidly evolving field of single-cell sequencing has unshackled the cancer research community of these shortcomings. From furthering an elemental understanding of intra-tumoural genetic heterogeneity and cancer genome evolution to illuminating the governing principles of disease relapse and metastasis, we posit that single-cell sequencing promises to unravel the biology of all facets of this disease.
Nanopore-based fourth-generation DNA sequencing technology.
Feng, Yanxiao; Zhang, Yuechuan; Ying, Cuifeng; Wang, Deqiang; Du, Chunlei
2015-02-01
Nanopore-based sequencers, as the fourth-generation DNA sequencing technology, have the potential to quickly and reliably sequence the entire human genome for less than $1000, and possibly for even less than $100. The single-molecule techniques used by this technology allow us to further study the interaction between DNA and protein, as well as between protein and protein. Nanopore analysis opens a new door to molecular biology investigation at the single-molecule scale. In this article, we have reviewed academic achievements in nanopore technology from the past as well as the latest advances, including both biological and solid-state nanopores, and discussed their recent and potential applications. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.
Adaptive compressive learning for prediction of protein-protein interactions from primary sequence.
Zhang, Ya-Nan; Pan, Xiao-Yong; Huang, Yan; Shen, Hong-Bin
2011-08-21
Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems. Copyright © 2011 Elsevier Ltd. All rights reserved.
Crovadore, Julien; Calmin, Gautier; Chablais, Romain; Cochard, Bastien; Schulz, Torsten; Lefort, François
2016-10-06
We report here the whole-genome shotgun sequence of the strain UASWS1507 of the species Pseudomonas graminis, isolated in Switzerland from an apple tree. This is the first genome registered for this species, which is considered as a potential and valuable resource of biological control agents and biofertilizers for agriculture. Copyright © 2016 Crovadore et al.
Mind the gap; seven reasons to close fragmented genome assemblies
USDA-ARS?s Scientific Manuscript database
Like other domains of life, research into the biology of filamentous microbes has greatly benefited from the advent of whole-genome sequencing. Next-generation sequencing (NGS) technologies have revolutionized sequencing, making genomic sciences accessible to many academic laboratories including tho...
de Vries, Ronald P; Riley, Robert; Wiebenga, Ad; Aguilar-Osorio, Guillermo; Amillis, Sotiris; Uchima, Cristiane Akemi; Anderluh, Gregor; Asadollahi, Mojtaba; Askin, Marion; Barry, Kerrie; Battaglia, Evy; Bayram, Özgür; Benocci, Tiziano; Braus-Stromeyer, Susanna A; Caldana, Camila; Cánovas, David; Cerqueira, Gustavo C; Chen, Fusheng; Chen, Wanping; Choi, Cindy; Clum, Alicia; Dos Santos, Renato Augusto Corrêa; Damásio, André Ricardo de Lima; Diallinas, George; Emri, Tamás; Fekete, Erzsébet; Flipphi, Michel; Freyberg, Susanne; Gallo, Antonia; Gournas, Christos; Habgood, Rob; Hainaut, Matthieu; Harispe, María Laura; Henrissat, Bernard; Hildén, Kristiina S; Hope, Ryan; Hossain, Abeer; Karabika, Eugenia; Karaffa, Levente; Karányi, Zsolt; Kraševec, Nada; Kuo, Alan; Kusch, Harald; LaButti, Kurt; Lagendijk, Ellen L; Lapidus, Alla; Levasseur, Anthony; Lindquist, Erika; Lipzen, Anna; Logrieco, Antonio F; MacCabe, Andrew; Mäkelä, Miia R; Malavazi, Iran; Melin, Petter; Meyer, Vera; Mielnichuk, Natalia; Miskei, Márton; Molnár, Ákos P; Mulé, Giuseppina; Ngan, Chew Yee; Orejas, Margarita; Orosz, Erzsébet; Ouedraogo, Jean Paul; Overkamp, Karin M; Park, Hee-Soo; Perrone, Giancarlo; Piumi, Francois; Punt, Peter J; Ram, Arthur F J; Ramón, Ana; Rauscher, Stefan; Record, Eric; Riaño-Pachón, Diego Mauricio; Robert, Vincent; Röhrig, Julian; Ruller, Roberto; Salamov, Asaf; Salih, Nadhira S; Samson, Rob A; Sándor, Erzsébet; Sanguinetti, Manuel; Schütze, Tabea; Sepčić, Kristina; Shelest, Ekaterina; Sherlock, Gavin; Sophianopoulou, Vicky; Squina, Fabio M; Sun, Hui; Susca, Antonia; Todd, Richard B; Tsang, Adrian; Unkles, Shiela E; van de Wiele, Nathalie; van Rossen-Uffink, Diana; Oliveira, Juliana Velasco de Castro; Vesth, Tammi C; Visser, Jaap; Yu, Jae-Hyuk; Zhou, Miaomiao; Andersen, Mikael R; Archer, David B; Baker, Scott E; Benoit, Isabelle; Brakhage, Axel A; Braus, Gerhard H; Fischer, Reinhard; Frisvad, Jens C; Goldman, Gustavo H; Houbraken, Jos; Oakley, Berl; Pócsi, István; Scazzocchio, Claudio; Seiboth, Bernhard; vanKuyk, Patricia A; Wortman, Jennifer; Dyer, Paul S; Grigoriev, Igor V
2017-02-14
The fungal genus Aspergillus is of critical importance to humankind. Species include those with industrial applications, important pathogens of humans, animals and crops, a source of potent carcinogenic contaminants of food, and an important genetic model. The genome sequences of eight aspergilli have already been explored to investigate aspects of fungal biology, raising questions about evolution and specialization within this genus. We have generated genome sequences for ten novel, highly diverse Aspergillus species and compared these in detail to sister and more distant genera. Comparative studies of key aspects of fungal biology, including primary and secondary metabolism, stress response, biomass degradation, and signal transduction, revealed both conservation and diversity among the species. Observed genomic differences were validated with experimental studies. This revealed several highlights, such as the potential for sex in asexual species, organic acid production genes being a key feature of black aspergilli, alternative approaches for degrading plant biomass, and indications for the genetic basis of stress response. A genome-wide phylogenetic analysis demonstrated in detail the relationship of the newly genome sequenced species with other aspergilli. Many aspects of biological differences between fungal species cannot be explained by current knowledge obtained from genome sequences. The comparative genomics and experimental study, presented here, allows for the first time a genus-wide view of the biological diversity of the aspergilli and in many, but not all, cases linked genome differences to phenotype. Insights gained could be exploited for biotechnological and medical applications of fungi.
2005-01-01
Sequencing of the human genome has ushered in a new era of biology. The technologies developed to facilitate the sequencing of the human genome are now being applied to the sequencing of other genomes. In 2004, a partnership was formed between Washington University School of Medicine Genome Sequencing Center's Outreach Program and Washington University Department of Biology Science Outreach to create a video tour depicting the processes involved in large-scale sequencing. “Sequencing a Genome: Inside the Washington University Genome Sequencing Center” is a tour of the laboratory that follows the steps in the sequencing pipeline, interspersed with animated explanations of the scientific procedures used at the facility. Accompanying interviews with the staff illustrate different entry levels for a career in genome science. This video project serves as an example of how research and academic institutions can provide teachers and students with access and exposure to innovative technologies at the forefront of biomedical research. Initial feedback on the video from undergraduate students, high school teachers, and high school students provides suggestions for use of this video in a classroom setting to supplement present curricula. PMID:16341256
Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity
Hurst, Gregory D.D.
2017-01-01
High throughput (or ‘next generation’) sequencing has transformed most areas of biological research and is now a standard method that underpins empirical study of organismal biology, and (through comparison of genomes), reveals patterns of evolution. For projects focused on animals, these sequencing methods do not discriminate between the primary target of sequencing (the animal genome) and ‘contaminating’ material, such as associated microbes. A common first step is to filter out these contaminants to allow better assembly of the animal genome or transcriptome. Here, we aimed to assess if these ‘contaminations’ provide information with regard to biologically important microorganisms associated with the individual. To achieve this, we examined whether the short read data from Apis retrieved elements of its well established microbiome. To this end, we screened almost 1,000 short read libraries of honey bee (Apis sp.) DNA sequencing project for the presence of microbial sequences, and find sequences from known honey bee microbial associates in at least 11% of them. Further to this, we screened ∼500 Apis RNA sequencing libraries for evidence of viral infections, which were found to be present in about half of them. We then used the data to reconstruct draft genomes of three Apis associated bacteria, as well as several viral strains de novo. We conclude that ‘contamination’ in short read sequencing libraries can provide useful genomic information on microbial taxa known to be associated with the target organisms, and may even lead to the discovery of novel associations. Finally, we demonstrate that RNAseq samples from experiments commonly carry uneven viral loads across libraries. We note variation in viral presence and load may be a confounding feature of differential gene expression analyses, and as such it should be incorporated as a random factor in analyses. PMID:28717593
NRGC: a novel referential genome compression algorithm.
Saha, Subrata; Rajasekaran, Sanguthevar
2016-11-15
Next-generation sequencing techniques produce millions to billions of short reads. The procedure is not only very cost effective but also can be done in laboratory environment. The state-of-the-art sequence assemblers then construct the whole genomic sequence from these reads. Current cutting edge computing technology makes it possible to build genomic sequences from the billions of reads within a minimal cost and time. As a consequence, we see an explosion of biological sequences in recent times. In turn, the cost of storing the sequences in physical memory or transmitting them over the internet is becoming a major bottleneck for research and future medical applications. Data compression techniques are one of the most important remedies in this context. We are in need of suitable data compression algorithms that can exploit the inherent structure of biological sequences. Although standard data compression algorithms are prevalent, they are not suitable to compress biological sequencing data effectively. In this article, we propose a novel referential genome compression algorithm (NRGC) to effectively and efficiently compress the genomic sequences. We have done rigorous experiments to evaluate NRGC by taking a set of real human genomes. The simulation results show that our algorithm is indeed an effective genome compression algorithm that performs better than the best-known algorithms in most of the cases. Compression and decompression times are also very impressive. The implementations are freely available for non-commercial purposes. They can be downloaded from: http://www.engr.uconn.edu/~rajasek/NRGC.zip CONTACT: rajasek@engr.uconn.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Ortuño, Francisco M; Valenzuela, Olga; Rojas, Fernando; Pomares, Hector; Florido, Javier P; Urquiza, Jose M; Rojas, Ignacio
2013-09-01
Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences. The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments. The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.
Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity.
Gerth, Michael; Hurst, Gregory D D
2017-01-01
High throughput (or 'next generation') sequencing has transformed most areas of biological research and is now a standard method that underpins empirical study of organismal biology, and (through comparison of genomes), reveals patterns of evolution. For projects focused on animals, these sequencing methods do not discriminate between the primary target of sequencing (the animal genome) and 'contaminating' material, such as associated microbes. A common first step is to filter out these contaminants to allow better assembly of the animal genome or transcriptome. Here, we aimed to assess if these 'contaminations' provide information with regard to biologically important microorganisms associated with the individual. To achieve this, we examined whether the short read data from Apis retrieved elements of its well established microbiome. To this end, we screened almost 1,000 short read libraries of honey bee ( Apis sp.) DNA sequencing project for the presence of microbial sequences, and find sequences from known honey bee microbial associates in at least 11% of them. Further to this, we screened ∼500 Apis RNA sequencing libraries for evidence of viral infections, which were found to be present in about half of them. We then used the data to reconstruct draft genomes of three Apis associated bacteria, as well as several viral strains de novo . We conclude that 'contamination' in short read sequencing libraries can provide useful genomic information on microbial taxa known to be associated with the target organisms, and may even lead to the discovery of novel associations. Finally, we demonstrate that RNAseq samples from experiments commonly carry uneven viral loads across libraries. We note variation in viral presence and load may be a confounding feature of differential gene expression analyses, and as such it should be incorporated as a random factor in analyses.
RBT-GA: a novel metaheuristic for solving the Multiple Sequence Alignment problem.
Taheri, Javid; Zomaya, Albert Y
2009-07-07
Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs.
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G; Rigoutsos, Isidore; Kirino, Yohei
2017-05-19
Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Comparing viral metagenomics methods using a highly multiplexed human viral pathogens reagent
Li, Linlin; Deng, Xutao; Mee, Edward T.; Collot-Teixeira, Sophie; Anderson, Rob; Schepelmann, Silke; Minor, Philip D.; Delwart, Eric
2014-01-01
Unbiased metagenomic sequencing holds significant potential as a diagnostic tool for the simultaneous detection of any previously genetically described viral nucleic acids in clinical samples. Viral genome sequences can also inform on likely phenotypes including drug susceptibility or neutralization serotypes. In this study, different variables of the laboratory methods often used to generate viral metagenomics libraries on the efficiency of viral detection and virus genome coverage were compared. A biological reagent consisting of 25 different human RNA and DNA viral pathogens was used to estimate the effect of filtration and nuclease digestion, DNA/RNA extraction methods, pre-amplification and the use of different library preparation kits on the detection of viral nucleic acids. Filtration and nuclease treatment led to slight decreases in the percentage of viral sequence reads and number of viruses detected. For nucleic acid extractions silica spin columns improved viral sequence recovery relative to magnetic beads and Trizol extraction. Pre-amplification using random RT-PCR while generating more viral sequence reads resulted in detection of fewer viruses, more overlapping sequences, and lower genome coverage. The ScriptSeq library preparation method retrieved more viruses and a greater fraction of their genomes than the TruSeq and Nextera methods. Viral metagenomics sequencing was able to simultaneously detect up to 22 different viruses in the biological reagent analyzed including all those detected by qPCR. Further optimization will be required for the detection of viruses in biologically more complex samples such as tissues, blood, or feces. PMID:25497414
... Sheets A Brief Guide to Genomics About NHGRI Research About the International HapMap Project Biological Pathways Chromosome Abnormalities Chromosomes Cloning Comparative Genomics DNA Microarray Technology DNA Sequencing Deoxyribonucleic Acid ( ...
Communicating the Benefits of a Full Sequence of High School Science Courses
ERIC Educational Resources Information Center
Nicholas, Catherine Marie
2014-01-01
High school students are generally uninformed about the benefits of enrolling in a full sequence of science courses, therefore only about a third of our nation's high school graduates have completed the science sequence of Biology, Chemistry and Physics. The lack of students completing a full sequence of science courses contributes to the deficit…
ERIC Educational Resources Information Center
Larkin, Douglas B.
2016-01-01
This article examines the process of shifting to a "Physics First" sequence in science course offerings in three school districts in the United States. This curricular sequence reverses the more common U.S. high school sequence of biology/chemistry/physics, and has gained substantial support in the physics education community over the…
Integrated Modular Teaching of Human Biology for Primary Care Practitioners
ERIC Educational Resources Information Center
Glasgow, Michael S.
1977-01-01
Describes the use of integrated modular teaching of the human biology component of the Health Associate Program at Johns Hopkins University, where the goal is to develop an understanding of the sciences as applied to primary care. Discussion covers the module sequence, the human biology faculty, goals of the human biology faculty, laboratory…
Symposium: The Role of Biological Sciences in the Optometric Curriculum.
ERIC Educational Resources Information Center
And Others; Rapp, Jerry
1980-01-01
Papers from a symposium probing some of the curricular elements of the program in biological sciences at a school or college of optometry are provided. The overall program sequence in the biological sciences, microbiology, pharmacology, and the curriculum in the biological sciences from a clinical perspective are discussed. (Author/MLW)
Torque measurements reveal sequence-specific cooperative transitions in supercoiled DNA
Oberstrass, Florian C.; Fernandes, Louis E.; Bryant, Zev
2012-01-01
B-DNA becomes unstable under superhelical stress and is able to adopt a wide range of alternative conformations including strand-separated DNA and Z-DNA. Localized sequence-dependent structural transitions are important for the regulation of biological processes such as DNA replication and transcription. To directly probe the effect of sequence on structural transitions driven by torque, we have measured the torsional response of a panel of DNA sequences using single molecule assays that employ nanosphere rotational probes to achieve high torque resolution. The responses of Z-forming d(pGpC)n sequences match our predictions based on a theoretical treatment of cooperative transitions in helical polymers. “Bubble” templates containing 50–100 bp mismatch regions show cooperative structural transitions similar to B-DNA, although less torque is required to disrupt strand–strand interactions. Our mechanical measurements, including direct characterization of the torsional rigidity of strand-separated DNA, establish a framework for quantitative predictions of the complex torsional response of arbitrary sequences in their biological context. PMID:22474350
Learning Quantitative Sequence-Function Relationships from Massively Parallel Experiments
NASA Astrophysics Data System (ADS)
Atwal, Gurinder S.; Kinney, Justin B.
2016-03-01
A fundamental aspect of biological information processing is the ubiquity of sequence-function relationships—functions that map the sequence of DNA, RNA, or protein to a biochemically relevant activity. Most sequence-function relationships in biology are quantitative, but only recently have experimental techniques for effectively measuring these relationships been developed. The advent of such "massively parallel" experiments presents an exciting opportunity for the concepts and methods of statistical physics to inform the study of biological systems. After reviewing these recent experimental advances, we focus on the problem of how to infer parametric models of sequence-function relationships from the data produced by these experiments. Specifically, we retrace and extend recent theoretical work showing that inference based on mutual information, not the standard likelihood-based approach, is often necessary for accurately learning the parameters of these models. Closely connected with this result is the emergence of "diffeomorphic modes"—directions in parameter space that are far less constrained by data than likelihood-based inference would suggest. Analogous to Goldstone modes in physics, diffeomorphic modes arise from an arbitrarily broken symmetry of the inference problem. An analytically tractable model of a massively parallel experiment is then described, providing an explicit demonstration of these fundamental aspects of statistical inference. This paper concludes with an outlook on the theoretical and computational challenges currently facing studies of quantitative sequence-function relationships.
Hennebert, Elise; Maldonado, Barbara; Ladurner, Peter; Flammang, Patrick; Santos, Romana
2015-01-01
Adhesive secretions occur in both aquatic and terrestrial animals, in which they perform diverse functions. Biological adhesives can therefore be remarkably complex and involve a large range of components with different functions and interactions. However, being mainly protein based, biological adhesives can be characterized by classical molecular methods. This review compiles experimental strategies that were successfully used to identify, characterize and obtain the full-length sequence of adhesive proteins from nine biological models: echinoderms, barnacles, tubeworms, mussels, sticklebacks, slugs, velvet worms, spiders and ticks. A brief description and practical examples are given for a variety of tools used to study adhesive molecules at different levels from genes to secreted proteins. In most studies, proteins, extracted from secreted materials or from adhesive organs, are analysed for the presence of post-translational modifications and submitted to peptide sequencing. The peptide sequences are then used directly for a BLAST search in genomic or transcriptomic databases, or to design degenerate primers to perform RT-PCR, both allowing the recovery of the sequence of the cDNA coding for the investigated protein. These sequences can then be used for functional validation and recombinant production. In recent years, the dual proteomic and transcriptomic approach has emerged as the best way leading to the identification of novel adhesive proteins and retrieval of their complete sequences. PMID:25657842
Natural product-inspired cascade synthesis yields modulators of centrosome integrity.
Dückert, Heiko; Pries, Verena; Khedkar, Vivek; Menninger, Sascha; Bruss, Hanna; Bird, Alexander W; Maliga, Zoltan; Brockmeyer, Andreas; Janning, Petra; Hyman, Anthony; Grimme, Stefan; Schürmann, Markus; Preut, Hans; Hübel, Katja; Ziegler, Slava; Kumar, Kamal; Waldmann, Herbert
2011-12-25
In biology-oriented synthesis, the scaffolds of biologically relevant compound classes inspire the synthesis of focused compound collections enriched in bioactivity. This criterion is, in particular, met by the scaffolds of natural products selected in evolution. The synthesis of natural product-inspired compound collections calls for efficient reaction sequences that preferably combine multiple individual transformations in one operation. Here we report the development of a one-pot, twelve-step cascade reaction sequence that includes nine different reactions and two opposing kinds of organocatalysis. The cascade sequence proceeds within 10-30 min and transforms readily available substrates into complex indoloquinolizines that resemble the core tetracyclic scaffold of numerous polycyclic indole alkaloids. Biological investigation of a corresponding focused compound collection revealed modulators of centrosome integrity, termed centrocountins, which caused fragmented and supernumerary centrosomes, chromosome congression defects, multipolar mitotic spindles, acentrosomal spindle poles and multipolar cell division by targeting the centrosome-associated proteins nucleophosmin and Crm1.
Automated design of genetic toggle switches with predetermined bistability.
Chen, Shuobing; Zhang, Haoqian; Shi, Handuo; Ji, Weiyue; Feng, Jingchen; Gong, Yan; Yang, Zhenglin; Ouyang, Qi
2012-07-20
Synthetic biology aims to rationally construct biological devices with required functionalities. Methods that automate the design of genetic devices without post-hoc adjustment are therefore highly desired. Here we provide a method to predictably design genetic toggle switches with predetermined bistability. To accomplish this task, a biophysical model that links ribosome binding site (RBS) DNA sequence to toggle switch bistability was first developed by integrating a stochastic model with RBS design method. Then, to parametrize the model, a library of genetic toggle switch mutants was experimentally built, followed by establishing the equivalence between RBS DNA sequences and switch bistability. To test this equivalence, RBS nucleotide sequences for different specified bistabilities were in silico designed and experimentally verified. Results show that the deciphered equivalence is highly predictive for the toggle switch design with predetermined bistability. This method can be generalized to quantitative design of other probabilistic genetic devices in synthetic biology.
Shah, Kushani; Thomas, Shelby; Stein, Arnold
2013-01-01
In this report, we describe a 5-week laboratory exercise for undergraduate biology and biochemistry students in which students learn to sequence DNA and to genotype their DNA for selected single nucleotide polymorphisms (SNPs). Students use miniaturized DNA sequencing gels that require approximately 8 min to run. The students perform G, A, T, C Sanger sequencing reactions. They prepare and run the gels, perform Southern blots (which require only 10 min), and detect sequencing ladders using a colorimetric detection system. Students enlarge their sequencing ladders from digital images of their small nylon membranes, and read the sequence manually. They compare their reads with the actual DNA sequence using BLAST2. After mastering the DNA sequencing system, students prepare their own DNA from a cheek swab, polymerase chain reaction-amplify a region of their DNA that encompasses a SNP of interest, and perform sequencing to determine their genotype at the SNP position. A family pedigree can also be constructed. The SNP chosen by the instructor was rs17822931, which is in the ABCC11 gene and is the determinant of human earwax type. Genotypes at the rs178229931 site vary in different ethnic populations. © 2013 by The International Union of Biochemistry and Molecular Biology.
Fractals in biology and medicine
NASA Technical Reports Server (NTRS)
Havlin, S.; Buldyrev, S. V.; Goldberger, A. L.; Mantegna, R. N.; Ossadnik, S. M.; Peng, C. K.; Simons, M.; Stanley, H. E.
1995-01-01
Our purpose is to describe some recent progress in applying fractal concepts to systems of relevance to biology and medicine. We review several biological systems characterized by fractal geometry, with a particular focus on the long-range power-law correlations found recently in DNA sequences containing noncoding material. Furthermore, we discuss the finding that the exponent alpha quantifying these long-range correlations ("fractal complexity") is smaller for coding than for noncoding sequences. We also discuss the application of fractal scaling analysis to the dynamics of heartbeat regulation, and report the recent finding that the normal heart is characterized by long-range "anticorrelations" which are absent in the diseased heart.
Using Maximum Entropy to Find Patterns in Genomes
NASA Astrophysics Data System (ADS)
Liu, Sophia; Hockenberry, Adam; Lancichinetti, Andrea; Jewett, Michael; Amaral, Luis
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. To accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. This approach can also be easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes. National Institute of General Medical Science, Northwestern University Presidential Fellowship, National Science Foundation, David and Lucile Packard Foundation, Camille Dreyfus Teacher Scholar Award.
BASiNET-BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification.
Ito, Eric Augusto; Katahira, Isaque; Vicente, Fábio Fernandes da Rocha; Pereira, Luiz Filipe Protasio; Lopes, Fabrício Martins
2018-06-05
With the emergence of Next Generation Sequencing (NGS) technologies, a large volume of sequence data in particular de novo sequencing was rapidly produced at relatively low costs. In this context, computational tools are increasingly important to assist in the identification of relevant information to understand the functioning of organisms. This work introduces BASiNET, an alignment-free tool for classifying biological sequences based on the feature extraction from complex network measurements. The method initially transform the sequences and represents them as complex networks. Then it extracts topological measures and constructs a feature vector that is used to classify the sequences. The method was evaluated in the classification of coding and non-coding RNAs of 13 species and compared to the CNCI, PLEK and CPC2 methods. BASiNET outperformed all compared methods in all adopted organisms and datasets. BASiNET have classified sequences in all organisms with high accuracy and low standard deviation, showing that the method is robust and non-biased by the organism. The proposed methodology is implemented in open source in R language and freely available for download at https://cran.r-project.org/package=BASiNET.
Ikeda, Shun; Abe, Takashi; Nakamura, Yukiko; Kibinge, Nelson; Hirai Morita, Aki; Nakatani, Atsushi; Ono, Naoaki; Ikemura, Toshimichi; Nakamura, Kensuke; Altaf-Ul-Amin, Md; Kanaya, Shigehiko
2013-05-01
Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species-metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.
Conserved noncoding sequences conserve biological networks and influence genome evolution.
Xie, Jianbo; Qian, Kecheng; Si, Jingna; Xiao, Liang; Ci, Dong; Zhang, Deqiang
2018-05-01
Comparative genomics approaches have identified numerous conserved cis-regulatory sequences near genes in plant genomes. Despite the identification of these conserved noncoding sequences (CNSs), our knowledge of their functional importance and selection remains limited. Here, we used a combination of DNA methylome analysis, microarray expression analyses, and functional annotation to study these sequences in the model tree Populus trichocarpa. Methylation in CG contexts and non-CG contexts was lower in CNSs, particularly CNSs in the 5'-upstream regions of genes, compared with other sites in the genome. We observed that CNSs are enriched in genes with transcription and binding functions, and this also associated with syntenic genes and those from whole-genome duplications, suggesting that cis-regulatory sequences play a key role in genome evolution. We detected a significant positive correlation between CNS number and protein interactions, suggesting that CNSs may have roles in the evolution and maintenance of biological networks. The divergence of CNSs indicates that duplication-degeneration-complementation drives the subfunctionalization of a proportion of duplicated genes from whole-genome duplication. Furthermore, population genomics confirmed that most CNSs are under strong purifying selection and only a small subset of CNSs shows evidence of adaptive evolution. These findings provide a foundation for future studies exploring these key genomic features in the maintenance of biological networks, local adaptation, and transcription.
incaRNAfbinv: a web server for the fragment-based design of RNA sequences
Drory Retwitzer, Matan; Reinharz, Vladimir; Ponty, Yann; Waldispühl, Jérôme; Barash, Danny
2016-01-01
Abstract In recent years, new methods for computational RNA design have been developed and applied to various problems in synthetic biology and nanotechnology. Lately, there is considerable interest in incorporating essential biological information when solving the inverse RNA folding problem. Correspondingly, RNAfbinv aims at including biologically meaningful constraints and is the only program to-date that performs a fragment-based design of RNA sequences. In doing so it allows the design of sequences that do not necessarily exactly fold into the target, as long as the overall coarse-grained tree graph shape is preserved. Augmented by the weighted sampling algorithm of incaRNAtion, our web server called incaRNAfbinv implements the method devised in RNAfbinv and offers an interactive environment for the inverse folding of RNA using a fragment-based design approach. It takes as input: a target RNA secondary structure; optional sequence and motif constraints; optional target minimum free energy, neutrality and GC content. In addition to the design of synthetic regulatory sequences, it can be used as a pre-processing step for the detection of novel natural occurring RNAs. The two complementary methodologies RNAfbinv and incaRNAtion are merged together and fully implemented in our web server incaRNAfbinv, available at http://www.cs.bgu.ac.il/incaRNAfbinv. PMID:27185893
Gabanyi, Margaret J; Adams, Paul D; Arnold, Konstantin; Bordoli, Lorenza; Carter, Lester G; Flippen-Andersen, Judith; Gifford, Lida; Haas, Juergen; Kouranov, Andrei; McLaughlin, William A; Micallef, David I; Minor, Wladek; Shah, Raship; Schwede, Torsten; Tao, Yi-Ping; Westbrook, John D; Zimmerman, Matthew; Berman, Helen M
2011-07-01
The Protein Structure Initiative's Structural Biology Knowledgebase (SBKB, URL: http://sbkb.org ) is an open web resource designed to turn the products of the structural genomics and structural biology efforts into knowledge that can be used by the biological community to understand living systems and disease. Here we will present examples on how to use the SBKB to enable biological research. For example, a protein sequence or Protein Data Bank (PDB) structure ID search will provide a list of related protein structures in the PDB, associated biological descriptions (annotations), homology models, structural genomics protein target status, experimental protocols, and the ability to order available DNA clones from the PSI:Biology-Materials Repository. A text search will find publication and technology reports resulting from the PSI's high-throughput research efforts. Web tools that aid in research, including a system that accepts protein structure requests from the community, will also be described. Created in collaboration with the Nature Publishing Group, the Structural Biology Knowledgebase monthly update also provides a research library, editorials about new research advances, news, and an events calendar to present a broader view of structural genomics and structural biology.
Manthey, Seth; Brewe, Eric
2013-01-01
University Modeling Instruction (UMI) is an approach to curriculum and pedagogy that focuses instruction on engaging students in building, validating, and deploying scientific models. Modeling Instruction has been successfully implemented in both high school and university physics courses. Studies within the physics education research (PER) community have identified UMI's positive impacts on learning gains, equity, attitudinal shifts, and self-efficacy. While the success of this pedagogical approach has been recognized within the physics community, the use of models and modeling practices is still being developed for biology. Drawing from the existing research on UMI in physics, we describe the theoretical foundations of UMI and how UMI can be adapted to include an emphasis on models and modeling for undergraduate introductory biology courses. In particular, we discuss our ongoing work to develop a framework for the first semester of a two-semester introductory biology course sequence by identifying the essential basic models for an introductory biology course sequence. PMID:23737628
Manthey, Seth; Brewe, Eric
2013-06-01
University Modeling Instruction (UMI) is an approach to curriculum and pedagogy that focuses instruction on engaging students in building, validating, and deploying scientific models. Modeling Instruction has been successfully implemented in both high school and university physics courses. Studies within the physics education research (PER) community have identified UMI's positive impacts on learning gains, equity, attitudinal shifts, and self-efficacy. While the success of this pedagogical approach has been recognized within the physics community, the use of models and modeling practices is still being developed for biology. Drawing from the existing research on UMI in physics, we describe the theoretical foundations of UMI and how UMI can be adapted to include an emphasis on models and modeling for undergraduate introductory biology courses. In particular, we discuss our ongoing work to develop a framework for the first semester of a two-semester introductory biology course sequence by identifying the essential basic models for an introductory biology course sequence.
ERIC Educational Resources Information Center
Rissing, Steven W.
2013-01-01
Most American colleges and universities offer gateway biology courses to meet the needs of three undergraduate audiences: biology and related science majors, many of whom will become biomedical researchers; premedical students meeting medical school requirements and preparing for the Medical College Admissions Test (MCAT); and students completing…
ERIC Educational Resources Information Center
Auerbach, Anna Jo; Schussler, Elisabeth
2017-01-01
Increasing faculty use of active-learning (AL) pedagogies in college classrooms is a persistent challenge in biology education. A large research-intensive university implemented changes to its biology majors' two-course introductory sequence as outlined by the "Vision and Change in Undergraduate Biology Education" final report. One goal…
Scalable Kernel Methods and Algorithms for General Sequence Analysis
ERIC Educational Resources Information Center
Kuksa, Pavel
2011-01-01
Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack…
USDA-ARS?s Scientific Manuscript database
Modern biological analyses are often assisted by recent technologies making the sequencing of complex genomes both technically possible and feasible. We recently sequenced the tomato genome that, like many eukaryotic genomes, is large and complex. Current sequencing technologies allow the developmen...
Discovery of Escherichia coli CRISPR sequences in an undergraduate laboratory.
Militello, Kevin T; Lazatin, Justine C
2017-05-01
Clustered regularly interspaced short palindromic repeats (CRISPRs) represent a novel type of adaptive immune system found in eubacteria and archaebacteria. CRISPRs have recently generated a lot of attention due to their unique ability to catalog foreign nucleic acids, their ability to destroy foreign nucleic acids in a mechanism that shares some similarity to RNA interference, and the ability to utilize reconstituted CRISPR systems for genome editing in numerous organisms. In order to introduce CRISPR biology into an undergraduate upper-level laboratory, a five-week set of exercises was designed to allow students to examine the CRISPR status of uncharacterized Escherichia coli strains and to allow the discovery of new repeats and spacers. Students started the project by isolating genomic DNA from E. coli and amplifying the iap CRISPR locus using the polymerase chain reaction (PCR). The PCR products were analyzed by Sanger DNA sequencing, and the sequences were examined for the presence of CRISPR repeat sequences. The regions between the repeats, the spacers, were extracted and analyzed with BLASTN searches. Overall, CRISPR loci were sequenced from several previously uncharacterized E. coli strains and one E. coli K-12 strain. Sanger DNA sequencing resulted in the discovery of 36 spacer sequences and their corresponding surrounding repeat sequences. Five of the spacers were homologous to foreign (non-E. coli) DNA. Assessment of the laboratory indicates that improvements were made in the ability of students to answer questions relating to the structure and function of CRISPRs. Future directions of the laboratory are presented and discussed. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):262-269, 2017. © 2016 The International Union of Biochemistry and Molecular Biology.
Predicting PDZ domain mediated protein interactions from structure
2013-01-01
Background PDZ domains are structural protein domains that recognize simple linear amino acid motifs, often at protein C-termini, and mediate protein-protein interactions (PPIs) in important biological processes, such as ion channel regulation, cell polarity and neural development. PDZ domain-peptide interaction predictors have been developed based on domain and peptide sequence information. Since domain structure is known to influence binding specificity, we hypothesized that structural information could be used to predict new interactions compared to sequence-based predictors. Results We developed a novel computational predictor of PDZ domain and C-terminal peptide interactions using a support vector machine trained with PDZ domain structure and peptide sequence information. Performance was estimated using extensive cross validation testing. We used the structure-based predictor to scan the human proteome for ligands of 218 PDZ domains and show that the predictions correspond to known PDZ domain-peptide interactions and PPIs in curated databases. The structure-based predictor is complementary to the sequence-based predictor, finding unique known and novel PPIs, and is less dependent on training–testing domain sequence similarity. We used a functional enrichment analysis of our hits to create a predicted map of PDZ domain biology. This map highlights PDZ domain involvement in diverse biological processes, some only found by the structure-based predictor. Based on this analysis, we predict novel PDZ domain involvement in xenobiotic metabolism and suggest new interactions for other processes including wound healing and Wnt signalling. Conclusions We built a structure-based predictor of PDZ domain-peptide interactions, which can be used to scan C-terminal proteomes for PDZ interactions. We also show that the structure-based predictor finds many known PDZ mediated PPIs in human that were not found by our previous sequence-based predictor and is less dependent on training–testing domain sequence similarity. Using both predictors, we defined a functional map of human PDZ domain biology and predict novel PDZ domain function. Users may access our structure-based and previous sequence-based predictors at http://webservice.baderlab.org/domains/POW. PMID:23336252
Barling, Adam; Swaminathan, Kankshita; Mitros, Therese; James, Brandon T; Morris, Juliette; Ngamboma, Ornella; Hall, Megan C; Kirkpatrick, Jessica; Alabady, Magdy; Spence, Ashley K; Hudson, Matthew E; Rokhsar, Daniel S; Moose, Stephen P
2013-12-09
The Miscanthus genus of perennial C4 grasses contains promising biofuel crops for temperate climates. However, few genomic resources exist for Miscanthus, which limits understanding of its interesting biology and future genetic improvement. A comprehensive catalog of expressed sequences were generated from a variety of Miscanthus species and tissue types, with an emphasis on characterizing gene expression changes in spring compared to fall rhizomes. Illumina short read sequencing technology was used to produce transcriptome sequences from different tissues and organs during distinct developmental stages for multiple Miscanthus species, including Miscanthus sinensis, Miscanthus sacchariflorus, and their interspecific hybrid Miscanthus × giganteus. More than fifty billion base-pairs of Miscanthus transcript sequence were produced. Overall, 26,230 Sorghum gene models (i.e., ~ 96% of predicted Sorghum genes) had at least five Miscanthus reads mapped to them, suggesting that a large portion of the Miscanthus transcriptome is represented in this dataset. The Miscanthus × giganteus data was used to identify genes preferentially expressed in a single tissue, such as the spring rhizome, using Sorghum bicolor as a reference. Quantitative real-time PCR was used to verify examples of preferential expression predicted via RNA-Seq. Contiguous consensus transcript sequences were assembled for each species and annotated using InterProScan. Sequences from the assembled transcriptome were used to amplify genomic segments from a doubled haploid Miscanthus sinensis and from Miscanthus × giganteus to further disentangle the allelic and paralogous variations in genes. This large expressed sequence tag collection creates a valuable resource for the study of Miscanthus biology by providing detailed gene sequence information and tissue preferred expression patterns. We have successfully generated a database of transcriptome assemblies and demonstrated its use in the study of genes of interest. Analysis of gene expression profiles revealed biological pathways that exhibit altered regulation in spring compared to fall rhizomes, which are consistent with their different physiological functions. The expression profiles of the subterranean rhizome provides a better understanding of the biological activities of the underground stem structures that are essentials for perenniality and the storage or remobilization of carbon and nutrient resources.
Song, Yuhyun; Leman, Scotland; Monteil, Caroline L.; Heath, Lenwood S.; Vinatzer, Boris A.
2014-01-01
A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today’s speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research. PMID:24586551
Biological function in the twilight zone of sequence conservation.
Ponting, Chris P
2017-08-16
Strong DNA conservation among divergent species is an indicator of enduring functionality. With weaker sequence conservation we enter a vast 'twilight zone' in which sequence subject to transient or lower constraint cannot be distinguished easily from neutrally evolving, non-functional sequence. Twilight zone functional sequence is illuminated instead by principles of selective constraint and positive selection using genomic data acquired from within a species' population. Application of these principles reveals that despite being biochemically active, most twilight zone sequence is not functional.
Blount, Benjamin A.; Weenink, Tim; Vasylechko, Serge; Ellis, Tom
2012-01-01
Yeast is an ideal organism for the development and application of synthetic biology, yet there remain relatively few well-characterised biological parts suitable for precise engineering of this chassis. In order to address this current need, we present here a strategy that takes a single biological part, a promoter, and re-engineers it to produce a fine-graded output range promoter library and new regulated promoters desirable for orthogonal synthetic biology applications. A highly constitutive Saccharomyces cerevisiae promoter, PFY1p, was identified by bioinformatic approaches, characterised in vivo and diversified at its core sequence to create a 36-member promoter library. TetR regulation was introduced into PFY1p to create a synthetic inducible promoter (iPFY1p) that functions in an inverter device. Orthogonal and scalable regulation of synthetic promoters was then demonstrated for the first time using customisable Transcription Activator-Like Effectors (TALEs) modified and designed to act as orthogonal repressors for specific PFY1-based promoters. The ability to diversify a promoter at its core sequences and then independently target Transcription Activator-Like Orthogonal Repressors (TALORs) to virtually any of these sequences shows great promise toward the design and construction of future synthetic gene networks that encode complex “multi-wire” logic functions. PMID:22442681
Blount, Benjamin A; Weenink, Tim; Vasylechko, Serge; Ellis, Tom
2012-01-01
Yeast is an ideal organism for the development and application of synthetic biology, yet there remain relatively few well-characterised biological parts suitable for precise engineering of this chassis. In order to address this current need, we present here a strategy that takes a single biological part, a promoter, and re-engineers it to produce a fine-graded output range promoter library and new regulated promoters desirable for orthogonal synthetic biology applications. A highly constitutive Saccharomyces cerevisiae promoter, PFY1p, was identified by bioinformatic approaches, characterised in vivo and diversified at its core sequence to create a 36-member promoter library. TetR regulation was introduced into PFY1p to create a synthetic inducible promoter (iPFY1p) that functions in an inverter device. Orthogonal and scalable regulation of synthetic promoters was then demonstrated for the first time using customisable Transcription Activator-Like Effectors (TALEs) modified and designed to act as orthogonal repressors for specific PFY1-based promoters. The ability to diversify a promoter at its core sequences and then independently target Transcription Activator-Like Orthogonal Repressors (TALORs) to virtually any of these sequences shows great promise toward the design and construction of future synthetic gene networks that encode complex "multi-wire" logic functions.
RBT-GA: a novel metaheuristic for solving the multiple sequence alignment problem
Taheri, Javid; Zomaya, Albert Y
2009-01-01
Background Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. Results This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. Conclusion RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences. PMID:19594869
2011-01-01
Background We present the genome sequence of the tammar wallaby, Macropus eugenii, which is a member of the kangaroo family and the first representative of the iconic hopping mammals that symbolize Australia to be sequenced. The tammar has many unusual biological characteristics, including the longest period of embryonic diapause of any mammal, extremely synchronized seasonal breeding and prolonged and sophisticated lactation within a well-defined pouch. Like other marsupials, it gives birth to highly altricial young, and has a small number of very large chromosomes, making it a valuable model for genomics, reproduction and development. Results The genome has been sequenced to 2 × coverage using Sanger sequencing, enhanced with additional next generation sequencing and the integration of extensive physical and linkage maps to build the genome assembly. We also sequenced the tammar transcriptome across many tissues and developmental time points. Our analyses of these data shed light on mammalian reproduction, development and genome evolution: there is innovation in reproductive and lactational genes, rapid evolution of germ cell genes, and incomplete, locus-specific X inactivation. We also observe novel retrotransposons and a highly rearranged major histocompatibility complex, with many class I genes located outside the complex. Novel microRNAs in the tammar HOX clusters uncover new potential mammalian HOX regulatory elements. Conclusions Analyses of these resources enhance our understanding of marsupial gene evolution, identify marsupial-specific conserved non-coding elements and critical genes across a range of biological systems, including reproduction, development and immunity, and provide new insight into marsupial and mammalian biology and genome evolution. PMID:21854559
Renfree, Marilyn B; Papenfuss, Anthony T; Deakin, Janine E; Lindsay, James; Heider, Thomas; Belov, Katherine; Rens, Willem; Waters, Paul D; Pharo, Elizabeth A; Shaw, Geoff; Wong, Emily S W; Lefèvre, Christophe M; Nicholas, Kevin R; Kuroki, Yoko; Wakefield, Matthew J; Zenger, Kyall R; Wang, Chenwei; Ferguson-Smith, Malcolm; Nicholas, Frank W; Hickford, Danielle; Yu, Hongshi; Short, Kirsty R; Siddle, Hannah V; Frankenberg, Stephen R; Chew, Keng Yih; Menzies, Brandon R; Stringer, Jessica M; Suzuki, Shunsuke; Hore, Timothy A; Delbridge, Margaret L; Patel, Hardip R; Mohammadi, Amir; Schneider, Nanette Y; Hu, Yanqiu; O'Hara, William; Al Nadaf, Shafagh; Wu, Chen; Feng, Zhi-Ping; Cocks, Benjamin G; Wang, Jianghui; Flicek, Paul; Searle, Stephen M J; Fairley, Susan; Beal, Kathryn; Herrero, Javier; Carone, Dawn M; Suzuki, Yutaka; Sugano, Sumio; Toyoda, Atsushi; Sakaki, Yoshiyuki; Kondo, Shinji; Nishida, Yuichiro; Tatsumoto, Shoji; Mandiou, Ion; Hsu, Arthur; McColl, Kaighin A; Lansdell, Benjamin; Weinstock, George; Kuczek, Elizabeth; McGrath, Annette; Wilson, Peter; Men, Artem; Hazar-Rethinam, Mehlika; Hall, Allison; Davis, John; Wood, David; Williams, Sarah; Sundaravadanam, Yogi; Muzny, Donna M; Jhangiani, Shalini N; Lewis, Lora R; Morgan, Margaret B; Okwuonu, Geoffrey O; Ruiz, San Juana; Santibanez, Jireh; Nazareth, Lynne; Cree, Andrew; Fowler, Gerald; Kovar, Christie L; Dinh, Huyen H; Joshi, Vandita; Jing, Chyn; Lara, Fremiet; Thornton, Rebecca; Chen, Lei; Deng, Jixin; Liu, Yue; Shen, Joshua Y; Song, Xing-Zhi; Edson, Janette; Troon, Carmen; Thomas, Daniel; Stephens, Amber; Yapa, Lankesha; Levchenko, Tanya; Gibbs, Richard A; Cooper, Desmond W; Speed, Terence P; Fujiyama, Asao; Graves, Jennifer A M; O'Neill, Rachel J; Pask, Andrew J; Forrest, Susan M; Worley, Kim C
2011-08-29
We present the genome sequence of the tammar wallaby, Macropus eugenii, which is a member of the kangaroo family and the first representative of the iconic hopping mammals that symbolize Australia to be sequenced. The tammar has many unusual biological characteristics, including the longest period of embryonic diapause of any mammal, extremely synchronized seasonal breeding and prolonged and sophisticated lactation within a well-defined pouch. Like other marsupials, it gives birth to highly altricial young, and has a small number of very large chromosomes, making it a valuable model for genomics, reproduction and development. The genome has been sequenced to 2 × coverage using Sanger sequencing, enhanced with additional next generation sequencing and the integration of extensive physical and linkage maps to build the genome assembly. We also sequenced the tammar transcriptome across many tissues and developmental time points. Our analyses of these data shed light on mammalian reproduction, development and genome evolution: there is innovation in reproductive and lactational genes, rapid evolution of germ cell genes, and incomplete, locus-specific X inactivation. We also observe novel retrotransposons and a highly rearranged major histocompatibility complex, with many class I genes located outside the complex. Novel microRNAs in the tammar HOX clusters uncover new potential mammalian HOX regulatory elements. Analyses of these resources enhance our understanding of marsupial gene evolution, identify marsupial-specific conserved non-coding elements and critical genes across a range of biological systems, including reproduction, development and immunity, and provide new insight into marsupial and mammalian biology and genome evolution.
Model annotation for synthetic biology: automating model to nucleotide sequence conversion
Misirli, Goksel; Hallinan, Jennifer S.; Yu, Tommy; Lawson, James R.; Wimalaratne, Sarala M.; Cooling, Michael T.; Wipat, Anil
2011-01-01
Motivation: The need for the automated computational design of genetic circuits is becoming increasingly apparent with the advent of ever more complex and ambitious synthetic biology projects. Currently, most circuits are designed through the assembly of models of individual parts such as promoters, ribosome binding sites and coding sequences. These low level models are combined to produce a dynamic model of a larger device that exhibits a desired behaviour. The larger model then acts as a blueprint for physical implementation at the DNA level. However, the conversion of models of complex genetic circuits into DNA sequences is a non-trivial undertaking due to the complexity of mapping the model parts to their physical manifestation. Automating this process is further hampered by the lack of computationally tractable information in most models. Results: We describe a method for automatically generating DNA sequences from dynamic models implemented in CellML and Systems Biology Markup Language (SBML). We also identify the metadata needed to annotate models to facilitate automated conversion, and propose and demonstrate a method for the markup of these models using RDF. Our algorithm has been implemented in a software tool called MoSeC. Availability: The software is available from the authors' web site http://research.ncl.ac.uk/synthetic_biology/downloads.html. Contact: anil.wipat@ncl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21296753
Synthetic Biology Open Language (SBOL) Version 2.1.0.
Beal, Jacob; Cox, Robert Sidney; Grünberg, Raik; McLaughlin, James; Nguyen, Tramy; Bartley, Bryan; Bissell, Michael; Choi, Kiri; Clancy, Kevin; Macklin, Chris; Madsen, Curtis; Misirli, Goksel; Oberortner, Ernst; Pocock, Matthew; Roehner, Nicholas; Samineni, Meher; Zhang, Michael; Zhang, Zhen; Zundel, Zach; Gennari, John H; Myers, Chris; Sauro, Herbert; Wipat, Anil
2016-09-01
Synthetic biology builds upon the techniques and successes of genetics, molecular biology, and metabolic engineering by applying engineering principles to the design of biological systems. The field still faces substantial challenges, including long development times, high rates of failure, and poor reproducibility. One method to ameliorate these problems would be to improve the exchange of information about designed systems between laboratories. The Synthetic Biology Open Language (SBOL) has been developed as a standard to support the specification and exchange of biological design information in synthetic biology, filling a need not satisfied by other pre-existing standards. This document details version 2.1 of SBOL that builds upon version 2.0 published in last year's JIB special issue. In particular, SBOL 2.1 includes improved rules for what constitutes a valid SBOL document, new role fields to simplify the expression of sequence features and how components are used in context, and new best practices descriptions to improve the exchange of basic sequence topology information and the description of genetic design provenance, as well as miscellaneous other minor improvements.
Synthetic Biology Open Language (SBOL) Version 2.1.0.
Beal, Jacob; Cox, Robert Sidney; Grünberg, Raik; McLaughlin, James; Nguyen, Tramy; Bartley, Bryan; Bissell, Michael; Choi, Kiri; Clancy, Kevin; Macklin, Chris; Madsen, Curtis; Misirli, Goksel; Oberortner, Ernst; Pocock, Matthew; Roehner, Nicholas; Samineni, Meher; Zhang, Michael; Zhang, Zhen; Zundel, Zach; Gennari, John; Myers, Chris; Sauro, Herbert; Wipat, Anil
2016-12-18
Synthetic biology builds upon the techniques and successes of genetics, molecular biology, and metabolic engineering by applying engineering principles to the design of biological systems. The field still faces substantial challenges, including long development times, high rates of failure, and poor reproducibility. One method to ameliorate these problems would be to improve the exchange of information about designed systems between laboratories. The Synthetic Biology Open Language (SBOL) has been developed as a standard to support the specification and exchange of biological design information in synthetic biology, filling a need not satisfied by other pre-existing standards. This document details version 2.1 of SBOL that builds upon version 2.0 published in last year’s JIB special issue. In particular, SBOL 2.1 includes improved rules for what constitutes a valid SBOL document, new role fields to simplify the expression of sequence features and how components are used in context, and new best practices descriptions to improve the exchange of basic sequence topology information and the description of genetic design provenance, as well as miscellaneous other minor improvements.
The Dynamics of DNA Sequencing.
ERIC Educational Resources Information Center
Morvillo, Nancy
1997-01-01
Describes a paper-and-pencil activity that helps students understand DNA sequencing and expands student understanding of DNA structure, replication, and gel electrophoresis. Appropriate for advanced biology students who are familiar with the Sanger method. (DDR)
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
ERIC Educational Resources Information Center
Luckie, Douglas B.; Hoskinson, Anne-Marie; Griffin, Caleigh E.; Hess, Andrea L.; Price, Katrina J.; Tawa, Alex; Thacker, Samantha M.
2017-01-01
The purpose of this study was to examine the educational impact of an intervention, the inquiry-focused textbook "Integrating Concepts in Biology" ("ICB"), when used in a yearlong introductory biology course sequence. Student learning was evaluated using three published instruments: 1) The Biology Concept Inventory probed depth…
Teaching Biology for a Sustainable Future
ERIC Educational Resources Information Center
Musante, Susan
2011-01-01
Students at Calvin College in Grand Rapids, Michigan, can now take an innovative biology course in which an integrated, interdisciplinary, problem-based approach is used--one that the scientific community itself is promoting. The first course in a four-semester sequence, Biology 123--The Living World: Concepts and Connections--explores real-world…
Bioinformatics: A History of Evolution "In Silico"
ERIC Educational Resources Information Center
Ondrej, Vladan; Dvorak, Petr
2012-01-01
Bioinformatics, biological databases, and the worldwide use of computers have accelerated biological research in many fields, such as evolutionary biology. Here, we describe a primer of nucleotide sequence management and the construction of a phylogenetic tree with two examples; the two selected are from completely different groups of organisms:…
Pooled-BAC sequencing of a black pod resistance region (cBPQTL12) in T. cacao
USDA-ARS?s Scientific Manuscript database
Whole genome sequencing (WGS) is an expensive and technically challenging endeavor. An alternative to WGS is to sequence specific chromosomal segments of biological interest (e.g. a QTL interval). This method is cheaper than WGS and reduces the risk of misassembly from distal parts of the genome. Us...
USDA-ARS?s Scientific Manuscript database
There is a growing need to combine DNA sequencing technologies to address complex problems in genome biology. These genomic studies routinely generate voluminous image, sequence, and mapping files that should be associated with quality control information (gels, spectra, etc.), and other important ...
BIOPEP database and other programs for processing bioactive peptide sequences.
Minkiewicz, Piotr; Dziuba, Jerzy; Iwaniak, Anna; Dziuba, Marta; Darewicz, Małgorzata
2008-01-01
This review presents the potential for application of computational tools in peptide science based on a sample BIOPEP database and program as well as other programs and databases available via the World Wide Web. The BIOPEP application contains a database of biologically active peptide sequences and a program enabling construction of profiles of the potential biological activity of protein fragments, calculation of quantitative descriptors as measures of the value of proteins as potential precursors of bioactive peptides, and prediction of bonds susceptible to hydrolysis by endopeptidases in a protein chain. Other bioactive and allergenic peptide sequence databases are also presented. Programs enabling the construction of binary and multiple alignments between peptide sequences, the construction of sequence motifs attributed to a given type of bioactivity, searching for potential precursors of bioactive peptides, and the prediction of sites susceptible to proteolytic cleavage in protein chains are available via the Internet as are other approaches concerning secondary structure prediction and calculation of physicochemical features based on amino acid sequence. Programs for prediction of allergenic and toxic properties have also been developed. This review explores the possibilities of cooperation between various programs.
Molecular Biology of the Extremely Thermophilic Archaebacterium, Methanothermus Fervidus.
1988-04-15
have sequenced the- 5S rRNA gene and part of the 16SrRNA gene from one of these clusters. The 5SrRNA shows features typical of all archaebacteria and is...gene sequence in all three biological kingdoms and the status of M. thermoautotrophicum. In: Proc. Fifth International Symposium on Microbial Growth on...Cl-Compounds. ed. van Verseveld, H.W. and Duine, J.A. pp. 255-260. 2. Reeve, J.N., Beckler, G.S. and Cram, D.S. 1987. Methanogens are archaebacteria
Facile Site-Directed Mutagenesis of Large Constructs Using Gibson Isothermal DNA Assembly.
Yonemoto, Isaac T; Weyman, Philip D
2017-01-01
Site-directed mutagenesis is a commonly used molecular biology technique to manipulate biological sequences, and is especially useful for studying sequence determinants of enzyme function or designing proteins with improved activity. We describe a strategy using Gibson Isothermal DNA Assembly to perform site-directed mutagenesis on large (>~20 kbp) constructs that are outside the effective range of standard techniques such as QuikChange II (Agilent Technologies), but more reliable than traditional cloning using restriction enzymes and ligation.
Reverse Genetics and High Throughput Sequencing Methodologies for Plant Functional Genomics
Ben-Amar, Anis; Daldoul, Samia; Reustle, Götz M.; Krczal, Gabriele; Mliki, Ahmed
2016-01-01
In the post-genomic era, increasingly sophisticated genetic tools are being developed with the long-term goal of understanding how the coordinated activity of genes gives rise to a complex organism. With the advent of the next generation sequencing associated with effective computational approaches, wide variety of plant species have been fully sequenced giving a wealth of data sequence information on structure and organization of plant genomes. Since thousands of gene sequences are already known, recently developed functional genomics approaches provide powerful tools to analyze plant gene functions through various gene manipulation technologies. Integration of different omics platforms along with gene annotation and computational analysis may elucidate a complete view in a system biology level. Extensive investigations on reverse genetics methodologies were deployed for assigning biological function to a specific gene or gene product. We provide here an updated overview of these high throughout strategies highlighting recent advances in the knowledge of functional genomics in plants. PMID:28217003
Finding similar nucleotide sequences using network BLAST searches.
Ladunga, Istvan
2009-06-01
The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge.
Huang, Ying; Chen, Shi-Yi; Deng, Feilong
2016-01-01
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.
Graphene Nanopores for Protein Sequencing.
Wilson, James; Sloman, Leila; He, Zhiren; Aksimentiev, Aleksei
2016-07-19
An inexpensive, reliable method for protein sequencing is essential to unraveling the biological mechanisms governing cellular behavior and disease. Current protein sequencing methods suffer from limitations associated with the size of proteins that can be sequenced, the time, and the cost of the sequencing procedures. Here, we report the results of all-atom molecular dynamics simulations that investigated the feasibility of using graphene nanopores for protein sequencing. We focus our study on the biologically significant phenylalanine-glycine repeat peptides (FG-nups)-parts of the nuclear pore transport machinery. Surprisingly, we found FG-nups to behave similarly to single stranded DNA: the peptides adhere to graphene and exhibit step-wise translocation when subject to a transmembrane bias or a hydrostatic pressure gradient. Reducing the peptide's charge density or increasing the peptide's hydrophobicity was found to decrease the translocation speed. Yet, unidirectional and stepwise translocation driven by a transmembrane bias was observed even when the ratio of charged to hydrophobic amino acids was as low as 1:8. The nanopore transport of the peptides was found to produce stepwise modulations of the nanopore ionic current correlated with the type of amino acids present in the nanopore, suggesting that protein sequencing by measuring ionic current blockades may be possible.
Sequence diversity of wheat mosaic virus isolates.
Stewart, Lucy R
2016-02-02
Wheat mosaic virus (WMoV), transmitted by eriophyid wheat curl mites (Aceria tosichella) is the causal agent of High Plains disease in wheat and maize. WMoV and other members of the genus Emaravirus evaded thorough molecular characterization for many years due to the experimental challenges of mite transmission and manipulating multisegmented negative sense RNA genomes. Recently, the complete genome sequence of a Nebraska isolate of WMoV revealed eight segments, plus a variant sequence of the nucleocapsid protein-encoding segment. Here, near-complete and partial consensus sequences of five more WMoV isolates are reported and compared to the Nebraska isolate: an Ohio maize isolate (GG1), a Kansas barley isolate (KS7), and three Ohio wheat isolates (H1, K1, W1). Results show two distinct groups of WMoV isolates: Ohio wheat isolate RNA segments had 84% or lower nucleotide sequence identity to the NE isolate, whereas GG1 and KS7 had 98% or higher nucleotide sequence identity to the NE isolate. Knowledge of the sequence variability of WMoV isolates is a step toward understanding virus biology, and potentially explaining observed biological variation. Published by Elsevier B.V.
Effect of amino acid substitution on biological activity of cyanophlyctin-β and brevinin-2R
NASA Astrophysics Data System (ADS)
Ghorani-Azam, Adel; Balali-Mood, Mahdi; Aryan, Ehsan; Karimi, Gholamreza; Riahi-Zanjani, Bamdad
2018-04-01
Antimicrobial peptides (AMPs), as ancient immune components, are found in almost all types of living organisms. They are bioactive components with strong antibacterial, antiviral, and anti-tumor properties. In this study, we designed three sequences of antimicrobial peptides to study the effects of structural changes in biological activity compared with original peptides, cyanophlyctin β, and brevinin-2R. For antibacterial activity, two Gram-positive (Staphylococcus aureus and S. epidermidis) and two Gram-negative bacteria (Escherichia coli and Pseudomonas aeroginosa) were assayed. Unlike cyanophlyctin β and brevinin-2R, the synthesized peptide (brevinin-M1, brevinin-M2 and brevinin-M3) showed no considerable antibacterial properties. Hemolytic activity of these peptides was also ignorable even at very high concentrations of 2 mg/ml. However, after proteolytic digestion by trypsin, the peptides showed antibacterial activity comparable to their original template sequences. Structural prediction suggested that the motif sequence responsible for antibacterial activity may be re-exposed to bacterial cell membrane after proteolytic digestion. Also, findings showed that only a small change in primary sequence and therefore structure of peptides may result in a significant alteration in biological activity.
The Human Genome Project: big science transforms biology and medicine.
Hood, Leroy; Rowen, Lee
2013-01-01
The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called 'big science' - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.
Getting to the root of plant biology: impact of the Arabidopsis genome sequence on root research
Benfey, Philip N.; Bennett, Malcolm; Schiefelbein, John
2010-01-01
Summary Prior to the availability of the genome sequence, the root of Arabidopsis had attracted a small but ardent group of researchers drawn to its accessibility and developmental simplicity. Roots are easily observed when grown on the surface of nutrient agar media, facilitating analysis of responses to stimuli such as gravity and touch. Developmental biologists were attracted to the simple radial organization of primary root tissues, which form a series of concentric cylinders around the central vascular tissue. Equally attractive was the mode of propagation, with stem cells at the tip giving rise to progeny that were confined to cell files. These properties of root development reduced the normal four-dimensional problem of development (three spatial dimensions and time) to a two-dimensional problem, with cell type on the radial axis and developmental time along the longitudinal axis. The availability of the complete Arabidopsis genome sequence has dramatically accelerated traditional genetic research on root biology, and has also enabled entirely new experimental strategies to be applied. Here we review examples of the ways in which availability of the Arabidopsis genome sequence has enhanced progress in understanding root biology. PMID:20409273
The Human Genome Project: big science transforms biology and medicine
2013-01-01
The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project. PMID:24040834
Object-oriented parsing of biological databases with Python.
Ramu, C; Gemünd, C; Gibson, T J
2000-07-01
While database activities in the biological area are increasing rapidly, rather little is done in the area of parsing them in a simple and object-oriented way. We present here an elegant, simple yet powerful way of parsing biological flat-file databases. We have taken EMBL, SWISSPROT and GENBANK as examples. EMBL and SWISS-PROT do not differ much in the format structure. GENBANK has a very different format structure than EMBL and SWISS-PROT. Extracting the desired fields in an entry (for example a sub-sequence with an associated feature) for later analysis is a constant need in the biological sequence-analysis community: this is illustrated with tools to make new splice-site databases. The interface to the parser is abstract in the sense that the access to all the databases is independent from their different formats, since parsing instructions are hidden.
The Intersection of Physics and Biology
Liphardt, Jan
2017-12-22
In April 1953, Watson and Crick largely defined the program of 20th century biology: obtaining the blueprint of life encoded in the DNA. Fifty years later, in 2003, the sequencing of the human genome was completed. Like any major scientific breakthrough, the sequencing of the human genome raised many more questions than it answered. I'll brief you on some of the big open problems in cell and developmental biology, and I'll explain why approaches, tools, and ideas from the physical sciences are currently reshaping biological research. Super-resolution light microscopies are revealing the intricate spatial organization of cells, single-molecule methods show how molecular machines function, and new probes are clarifying the role of mechanical forces in cell and tissue function. At the same time, Physics stands to gain beautiful new problems in soft condensed matter, quantum mechanics, and non-equilibrium thermodynamics.
Measuring the Evolutionary Rewiring of Biological Networks
Shou, Chong; Bhardwaj, Nitin; Lam, Hugo Y. K.; Yan, Koon-Kiu; Kim, Philip M.; Snyder, Michael; Gerstein, Mark B.
2011-01-01
We have accumulated a large amount of biological network data and expect even more to come. Soon, we anticipate being able to compare many different biological networks as we commonly do for molecular sequences. It has long been believed that many of these networks change, or “rewire”, at different rates. It is therefore important to develop a framework to quantify the differences between networks in a unified fashion. We developed such a formalism based on analogy to simple models of sequence evolution, and used it to conduct a systematic study of network rewiring on all the currently available biological networks. We found that, similar to sequences, biological networks show a decreased rate of change at large time divergences, because of saturation in potential substitutions. However, different types of biological networks consistently rewire at different rates. Using comparative genomics and proteomics data, we found a consistent ordering of the rewiring rates: transcription regulatory, phosphorylation regulatory, genetic interaction, miRNA regulatory, protein interaction, and metabolic pathway network, from fast to slow. This ordering was found in all comparisons we did of matched networks between organisms. To gain further intuition on network rewiring, we compared our observed rewirings with those obtained from simulation. We also investigated how readily our formalism could be mapped to other network contexts; in particular, we showed how it could be applied to analyze changes in a range of “commonplace” networks such as family trees, co-authorships and linux-kernel function dependencies. PMID:21253555
Mohorianu, Irina; Stocks, Matthew Benedict; Wood, John; Dalmay, Tamas; Moulton, Vincent
2013-07-01
Small RNAs (sRNAs) are 20-25 nt non-coding RNAs that act as guides for the highly sequence-specific regulatory mechanism known as RNA silencing. Due to the recent increase in sequencing depth, a highly complex and diverse population of sRNAs in both plants and animals has been revealed. However, the exponential increase in sequencing data has also made the identification of individual sRNA transcripts corresponding to biological units (sRNA loci) more challenging when based exclusively on the genomic location of the constituent sRNAs, hindering existing approaches to identify sRNA loci. To infer the location of significant biological units, we propose an approach for sRNA loci detection called CoLIde (Co-expression based sRNA Loci Identification) that combines genomic location with the analysis of other information such as variation in expression levels (expression pattern) and size class distribution. For CoLIde, we define a locus as a union of regions sharing the same pattern and located in close proximity on the genome. Biological relevance, detected through the analysis of size class distribution, is also calculated for each locus. CoLIde can be applied on ordered (e.g., time-dependent) or un-ordered (e.g., organ, mutant) series of samples both with or without biological/technical replicates. The method reliably identifies known types of loci and shows improved performance on sequencing data from both plants (e.g., A. thaliana, S. lycopersicum) and animals (e.g., D. melanogaster) when compared with existing locus detection techniques. CoLIde is available for use within the UEA Small RNA Workbench which can be downloaded from: http://srna-workbench.cmp.uea.ac.uk.
Noncoding sequence classification based on wavelet transform analysis: part II
NASA Astrophysics Data System (ADS)
Paredes, O.; Strojnik, M.; Romo-Vázquez, R.; Vélez-Pérez, H.; Ranta, R.; Garcia-Torales, G.; Scholl, M. K.; Morales, J. A.
2017-09-01
DNA sequences in human genome can be divided into the coding and noncoding ones. We hypothesize that the characteristic periodicities of the noncoding sequences are related to their function. We describe the procedure to identify these characteristic periodicities using the wavelet analysis. Our results show that three groups of noncoding sequences, each one with different biological function, may be differentiated by their wavelet coefficients within specific frequency range.
Fungal genome sequencing: basic biology to biotechnology.
Sharma, Krishna Kant
2016-08-01
The genome sequences provide a first glimpse into the genomic basis of the biological diversity of filamentous fungi and yeast. The genome sequence of the budding yeast, Saccharomyces cerevisiae, with a small genome size, unicellular growth, and rich history of genetic and molecular analyses was a milestone of early genomics in the 1990s. The subsequent completion of fission yeast, Schizosaccharomyces pombe and genetic model, Neurospora crassa initiated a revolution in the genomics of the fungal kingdom. In due course of time, a substantial number of fungal genomes have been sequenced and publicly released, representing the widest sampling of genomes from any eukaryotic kingdom. An ambitious genome-sequencing program provides a wealth of data on metabolic diversity within the fungal kingdom, thereby enhancing research into medical science, agriculture science, ecology, bioremediation, bioenergy, and the biotechnology industry. Fungal genomics have higher potential to positively affect human health, environmental health, and the planet's stored energy. With a significant increase in sequenced fungal genomes, the known diversity of genes encoding organic acids, antibiotics, enzymes, and their pathways has increased exponentially. Currently, over a hundred fungal genome sequences are publicly available; however, no inclusive review has been published. This review is an initiative to address the significance of the fungal genome-sequencing program and provides the road map for basic and applied research.
Geisler, Christoph; Jarvis, Donald L
2016-07-01
Spodoptera frugiperda (Sf) cell lines are used to produce several biologicals for human and veterinary use. Recently, it was discovered that all tested Sf cell lines are persistently infected with Sf-rhabdovirus, a novel rhabdovirus. As part of an effort to search for other adventitious viruses, we searched the Sf cell genome and transcriptome for sequences related to Sf-rhabdovirus. To our surprise, we found intact Sf-rhabdovirus N- and P-like ORFs, and partial Sf-rhabdovirus G- and L-like ORFs. The transcribed and genomic sequences matched, indicating the transcripts were derived from the genomic sequences. These appear to be endogenous viral elements (EVEs), which result from the integration of partial viral genetic material into the host cell genome. It is theoretically impossible for the Sf-rhabdovirus-like EVEs to produce infectious virus particles as 1) they are disseminated across 4 genomic loci, 2) the G and L ORFs are incomplete, and 3) the M ORF is missing. Our finding of transcribed virus-like sequences in Sf cells underscores that MPS-based searches for adventitious viruses in cell substrates used to manufacture biologics should take into account both genomic and transcribed sequences to facilitate the identification of transcribed EVE's, and to avoid false positive detection of replication-competent adventitious viruses. Copyright © 2016 International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
El-Assaad, Atlal; Dawy, Zaher; Nemer, Georges; Kobeissy, Firas
2017-01-01
The crucial biological role of proteases has been visible with the development of degradomics discipline involved in the determination of the proteases/substrates resulting in breakdown-products (BDPs) that can be utilized as putative biomarkers associated with different biological-clinical significance. In the field of cancer biology, matrix metalloproteinases (MMPs) have shown to result in MMPs-generated protein BDPs that are indicative of malignant growth in cancer, while in the field of neural injury, calpain-2 and caspase-3 proteases generate BDPs fragments that are indicative of different neural cell death mechanisms in different injury scenarios. Advanced proteomic techniques have shown a remarkable progress in identifying these BDPs experimentally. In this work, we present a bioinformatics-based prediction method that identifies protease-associated BDPs with high precision and efficiency. The method utilizes state-of-the-art sequence matching and alignment algorithms. It starts by locating consensus sequence occurrences and their variants in any set of protein substrates, generating all fragments resulting from cleavage. The complexity exists in space O(mn) as well as in O(Nmn) time, where N, m, and n are the number of protein sequences, length of the consensus sequence, and length per protein sequence, respectively. Finally, the proposed methodology is validated against βII-spectrin protein, a brain injury validated biomarker.
Pleurochrysome: A Web Database of Pleurochrysis Transcripts and Orthologs Among Heterogeneous Algae
Fujiwara, Shoko; Takatsuka, Yukiko; Hirokawa, Yasutaka; Tsuzuki, Mikio; Takano, Tomoyuki; Kobayashi, Masaaki; Suda, Kunihiro; Asamizu, Erika; Yokoyama, Koji; Shibata, Daisuke; Tabata, Satoshi; Yano, Kentaro
2016-01-01
Pleurochrysis is a coccolithophorid genus, which belongs to the Coccolithales in the Haptophyta. The genus has been used extensively for biological research, together with Emiliania in the Isochrysidales, to understand distinctive features between the two coccolithophorid-including orders. However, molecular biological research on Pleurochrysis such as elucidation of the molecular mechanism behind coccolith formation has not made great progress at least in part because of lack of comprehensive gene information. To provide such information to the research community, we built an open web database, the Pleurochrysome (http://bioinf.mind.meiji.ac.jp/phapt/), which currently stores 9,023 unique gene sequences (designated as UNIGENEs) assembled from expressed sequence tag sequences of P. haptonemofera as core information. The UNIGENEs were annotated with gene sequences sharing significant homology, conserved domains, Gene Ontology, KEGG Orthology, predicted subcellular localization, open reading frames and orthologous relationship with genes of 10 other algal species, a cyanobacterium and the yeast Saccharomyces cerevisiae. This sequence and annotation information can be easily accessed via several search functions. Besides fundamental functions such as BLAST and keyword searches, this database also offers search functions to explore orthologous genes in the 12 organisms and to seek novel genes. The Pleurochrysome will promote molecular biological and phylogenetic research on coccolithophorids and other haptophytes by helping scientists mine data from the primary transcriptome of P. haptonemofera. PMID:26746174
Impact of cultivation on characterisation of species composition of soil bacterial communities.
McCaig, A E.; Grayston, S J.; Prosser, J I.; Glover, L A.
2001-03-01
The species composition of culturable bacteria in Scottish grassland soils was investigated using a combination of Biolog and 16S rDNA analysis for characterisation of isolates. The inclusion of a molecular approach allowed direct comparison of sequences from culturable bacteria with sequences obtained during analysis of DNA extracted directly from the same soil samples. Bacterial strains were isolated on Pseudomonas isolation agar (PIA), a selective medium, and on tryptone soya agar (TSA), a general laboratory medium. In total, 12 and 21 morphologically different bacterial cultures were isolated on PIA and TSA, respectively. Biolog and sequencing placed PIA isolates in the same taxonomic groups, the majority of cultures belonging to the Pseudomonas (sensu stricto) group. However, analysis of 16S rDNA sequences proved more efficient than Biolog for characterising TSA isolates due to limitations of the Microlog database for identifying environmental bacteria. In general, 16S rDNA sequences from TSA isolates showed high similarities to cultured species represented in sequence databases, although TSA-8 showed only 92.5% similarity to the nearest relative, Bacillus insolitus. In general, there was very little overlap between the culturable and uncultured bacterial communities, although two sequences, PIA-2 and TSA-13, showed >99% similarity to soil clones. A cloning step was included prior to sequence analysis of two isolates, TSA-5 and TSA-14, and analysis of several clones confirmed that these cultures comprised at least four and three sequence types, respectively. All isolate clones were most closely related to uncultured bacteria, with clone TSA-5.1 showing 99.8% similarity to a sequence amplified directly from the same soil sample. Interestingly, one clone, TSA-5.4, clustered within a novel group comprising only uncultured sequences. This group, which is associated with the novel, deep-branching Acidobacterium capsulatum lineage, also included clones isolated during direct analysis of the same soil and from a wide range of other sample types studied elsewhere. The study demonstrates the value of fine-scale molecular analysis for identification of laboratory isolates and indicates the culturability of approximately 1% of the total population but under a restricted range of media and cultivation conditions.
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.
Höps, Wolfram; Jeffryes, Matt; Bateman, Alex
2018-01-01
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
PROFESS: a PROtein Function, Evolution, Structure and Sequence database
Triplet, Thomas; Shortridge, Matthew D.; Griep, Mark A.; Stark, Jaime L.; Powers, Robert; Revesz, Peter
2010-01-01
The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are ∼1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/∼profess/ PMID:20624718
Mazandu, Gaston K; Mulder, Nicola J
2012-07-01
Despite ever-increasing amounts of sequence and functional genomics data, there is still a deficiency of functional annotation for many newly sequenced proteins. For Mycobacterium tuberculosis (MTB), more than half of its genome is still uncharacterized, which hampers the search for new drug targets within the bacterial pathogen and limits our understanding of its pathogenicity. As for many other genomes, the annotations of proteins in the MTB proteome were generally inferred from sequence homology, which is effective but its applicability has limitations. We have carried out large-scale biological data integration to produce an MTB protein functional interaction network. Protein functional relationships were extracted from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database, and additional functional interactions from microarray, sequence and protein signature data. The confidence level of protein relationships in the additional functional interaction data was evaluated using a dynamic data-driven scoring system. This functional network has been used to predict functions of uncharacterized proteins using Gene Ontology (GO) terms, and the semantic similarity between these terms measured using a state-of-the-art GO similarity metric. To achieve better trade-off between improvement of quality, genomic coverage and scalability, this prediction is done by observing the key principles driving the biological organization of the functional network. This study yields a new functionally characterized MTB strain CDC1551 proteome, consisting of 3804 and 3698 proteins out of 4195 with annotations in terms of the biological process and molecular function ontologies, respectively. These data can contribute to research into the Development of effective anti-tubercular drugs with novel biological mechanisms of action. Copyright © 2011 Elsevier B.V. All rights reserved.
2012-01-01
Background RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Results Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. Conclusions This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates. PMID:22985019
Robles, José A; Qureshi, Sumaira E; Stephen, Stuart J; Wilson, Susan R; Burden, Conrad J; Taylor, Jennifer M
2012-09-17
RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.
Update on Genomic Databases and Resources at the National Center for Biotechnology Information.
Tatusova, Tatiana
2016-01-01
The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website, text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases.Comparative genome analysis tools lead to further understanding of evolution processes quickening the pace of discovery. Recent technological innovations have ignited an explosion in genome sequencing that has fundamentally changed our understanding of the biology of living organisms. This huge increase in DNA sequence data presents new challenges for the information management system and the visualization tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data.
Fink, J S; Verhave, M; Kasper, S; Tsukada, T; Mandel, G; Goodman, R H
1988-01-01
cAMP-regulated transcription of the human vasoactive intestinal peptide gene is dependent upon a 17-base-pair DNA element located 70 base pairs upstream from the transcriptional initiation site. This element is similar to sequences in other genes known to be regulated by cAMP and to sequences in several viral enhancers. We have demonstrated that the vasoactive intestinal peptide regulatory element is an enhancer that depends upon the integrity of two CGTCA sequence motifs for biological activity. Mutations in either of the CGTCA motifs diminish the ability of the element to respond to cAMP. Enhancers containing the CGTCA motif from the somatostatin and adenovirus genes compete for binding of nuclear proteins from C6 glioma and PC12 cells to the vasoactive intestinal peptide enhancer, suggesting that CGTCA-containing enhancers interact with similar transacting factors. Images PMID:2842787
Schoof, Heiko; Zaccaria, Paolo; Gundlach, Heidrun; Lemcke, Kai; Rudd, Stephen; Kolesov, Grigory; Arnold, Roland; Mewes, H. W.; Mayer, Klaus F. X.
2002-01-01
Arabidopsis thaliana is the first plant for which the complete genome has been sequenced and published. Annotation of complex eukaryotic genomes requires more than the assignment of genetic elements to the sequence. Besides completing the list of genes, we need to discover their cellular roles, their regulation and their interactions in order to understand the workings of the whole plant. The MIPS Arabidopsis thaliana Database (MAtDB; http://mips.gsf.de/proj/thal/db) started out as a repository for genome sequence data in the European Scientists Sequencing Arabidopsis (ESSA) project and the Arabidopsis Genome Initiative. Our aim is to transform MAtDB into an integrated biological knowledge resource by integrating diverse data, tools, query and visualization capabilities and by creating a comprehensive resource for Arabidopsis as a reference model for other species, including crop plants. PMID:11752263
Integrated Chemistry and Biology for First-Year College Students
ERIC Educational Resources Information Center
Abdella, Beth R. J.; Walczak, Mary M.; Kandl, Kim A.; Schwinefus, Jeffrey J.
2011-01-01
A three-course sequence for first-year students that integrates beginning concepts in biology and chemistry has been designed. The first two courses that emphasize chemistry and its capacity to inform biological applications are described here. The content of the first course moves from small to large particles with an emphasis on membrane…
Synthetic Biology: Knowledge Accessed by Everyone (Open Sources)
ERIC Educational Resources Information Center
Sánchez Reyes, Patricia Margarita
2016-01-01
Using the principles of biology, along with engineering and with the help of computer, scientists manage to copy. DNA sequences from nature and use them to create new organisms. DNA is created through engineering and computer science managing to create life inside a laboratory. We cannot dismiss the role that synthetic biology could lead in…
Counting Patterns in Degenerated Sequences
NASA Astrophysics Data System (ADS)
Nuel, Grégory
Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.
Arrays of nucleic acid probes on biological chips
Chee, Mark; Cronin, Maureen T.; Fodor, Stephen P. A.; Huang, Xiaohua X.; Hubbell, Earl A.; Lipshutz, Robert J.; Lobban, Peter E.; Morris, MacDonald S.; Sheldon, Edward L.
1998-11-17
DNA chips containing arrays of oligonucleotide probes can be used to determine whether a target nucleic acid has a nucleotide sequence identical to or different from a specific reference sequence. The array of probes comprises probes exactly complementary to the reference sequence, as well as probes that differ by one or more bases from the exactly complementary probes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cang, Zixuan; Mu, Lin; Wu, Kedi
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
Non-biological synthetic spike-in controls and the AMPtk software pipeline improve mycobiome data
Jonathan M. Palmer; Michelle A. Jusino; Mark T. Banik; Daniel L. Lindner
2018-01-01
High-throughput amplicon sequencing (HTAS) of conserved DNA regions is a powerful technique to characterize microbial communities. Recently, spike-in mock communities have been used to measure accuracy of sequencing platforms and data analysis pipelines. To assess the ability of sequencing platforms and data processing pipelines using fungal internal transcribed spacer...
Potentials of single-cell biology in identification and validation of disease biomarkers.
Niu, Furong; Wang, Diane C; Lu, Jiapei; Wu, Wei; Wang, Xiangdong
2016-09-01
Single-cell biology is considered a new approach to identify and validate disease-specific biomarkers. However, the concern raised by clinicians is how to apply single-cell measurements for clinical practice, translate the message of single-cell systems biology into clinical phenotype or explain alterations of single-cell gene sequencing and function in patient response to therapies. This study is to address the importance and necessity of single-cell gene sequencing in the identification and development of disease-specific biomarkers, the definition and significance of single-cell biology and single-cell systems biology in the understanding of single-cell full picture, the development and establishment of whole-cell models in the validation of targeted biological function and the figure and meaning of single-molecule imaging in single cell to trace intra-single-cell molecule expression, signal, interaction and location. We headline the important role of single-cell biology in the discovery and development of disease-specific biomarkers with a special emphasis on understanding single-cell biological functions, e.g. mechanical phenotypes, single-cell biology, heterogeneity and organization of genome function. We have reason to believe that such multi-dimensional, multi-layer, multi-crossing and stereoscopic single-cell biology definitely benefits the discovery and development of disease-specific biomarkers. © 2016 The Authors. Journal of Cellular and Molecular Medicine published by John Wiley & Sons Ltd and Foundation for Cellular and Molecular Medicine.
Wu, Chung Wah; Evans, Jared M; Huang, Shengbing; Mahoney, Douglas W; Dukek, Brian A; Taylor, William R; Yab, Tracy C; Smyrk, Thomas C; Jen, Jin; Kisiel, John B; Ahlquist, David A
2018-05-25
MicroRNA (miRNA) profiling is an important step in studying biological associations and identifying marker candidates. miRNA exists in isoforms, called isomiRs, which may exhibit distinct properties. With conventional profiling methods, limitations in assay and analysis platforms may compromise isomiR interrogation. We introduce a comprehensive approach to sequence-oriented isomiR annotation (CASMIR) to allow unbiased identification of global isomiRs from small RNA sequencing data. In this approach, small RNA reads are maintained as independent sequences instead of being summarized under miRNA names. IsomiR features are identified through step-wise local alignment against canonical forms and precursor sequences. Through customizing the reference database, CASMIR is applicable to isomiR annotation across species. To demonstrate its application, we investigated isomiR profiles in normal and neoplastic human colorectal epithelia. We also ran miRDeep2, a popular miRNA analysis algorithm to validate isomiRs annotated by CASMIR. With CASMIR, specific and biologically relevant isomiR patterns could be identified. We note that specific isomiRs are often more abundant than their canonical forms. We identify isomiRs that are commonly up-regulated in both colorectal cancer and advanced adenoma, and illustrate advantages in targeting isomiRs as potential biomarkers over canonical forms. Studying miRNAs at the isomiR level could reveal new insight into miRNA biology and inform assay design for specific isomiRs. CASMIR facilitates comprehensive annotation of isomiR features in small RNA sequencing data for isomiR profiling and differential expression analysis.
The sequence of sequencers: The history of sequencing DNA
Heather, James M.; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. PMID:26554401
Toyoda, Tetsuro
2011-01-01
Synthetic biology requires both engineering efficiency and compliance with safety guidelines and ethics. Focusing on the rational construction of biological systems based on engineering principles, synthetic biology depends on a genome-design platform to explore the combinations of multiple biological components or BIO bricks for quickly producing innovative devices. This chapter explains the differences among various platform models and details a methodology for promoting open innovation within the scope of the statutory exemption of patent laws. The detailed platform adopts a centralized evaluation model (CEM), computer-aided design (CAD) bricks, and a freemium model. It is also important for the platform to support the legal aspects of copyrights as well as patent and safety guidelines because intellectual work including DNA sequences designed rationally by human intelligence is basically copyrightable. An informational platform with high traceability, transparency, auditability, and security is required for copyright proof, safety compliance, and incentive management for open innovation in synthetic biology. GenoCon, which we have organized and explained here, is a competition-styled, open-innovation method involving worldwide participants from scientific, commercial, and educational communities that aims to improve the designs of genomic sequences that confer a desired function on an organism. Using only a Web browser, a participating contributor proposes a design expressed with CAD bricks that generate a relevant DNA sequence, which is then experimentally and intensively evaluated by the GenoCon organizers. The CAD bricks that comprise programs and databases as a Semantic Web are developed, executed, shared, reused, and well stocked on the secure Semantic Web platform called the Scientists' Networking System or SciNetS/SciNeS, based on which a CEM research center for synthetic biology and open innovation should be established. Copyright © 2011 Elsevier Inc. All rights reserved.
Tumor Biology and Immunology | Center for Cancer Research
Tumor Biology and Immunology The Comparative Brain Tumor Consortium is collaborating with National Center for Advanced Translational Sciences to complete whole exome sequencing on canine meningioma samples. Results will be published and made publicly available.
ERIC Educational Resources Information Center
School Science Review, 1976
1976-01-01
Describes nine biology experiments, including osmosis, genetics; oxygen content of blood, enzymes in bean seedlings, preparation of bird skins, vascularization in bean seedlings, a game called "sequences" (applied to review situations), crossword puzzle for human respiration, and physiology of the woodlouse. (CS)
Molecular-Sized DNA or RNA Sequencing Machine | NCI Technology Transfer Center | TTC
The National Cancer Institute's Gene Regulation and Chromosome Biology Laboratory is seeking statements of capability or interest from parties interested in collaborative research to co-develop a molecular-sized DNA or RNA sequencing machine.
Molecular Aspects of Head and Neck Cancer Therapy
Puram, Sidharth V.; Rocco, James W.
2015-01-01
Synopsis In spite of a rapidly expanding understanding of head and neck tumor biology as well as optimization of radiation, chemotherapy, and surgical treatment modalities, head and neck squamous cell carcinoma (HNSCC) remains a major cause of cancer related morbidity and mortality. Although our biologic understanding of these tumors had largely been limited to pathways driving proliferation, survival, and differentiation, the identification of HPV as a major driver of HNSCC, specifically oropharyngeal SCC, as well as recent genomic sequencing analyses of HNSCC has dramatically influenced our understanding of the underlying biology behind carcinogenesis, and in part, our approach to therapy. In particular, we are at a major molecular and clinical crossroads with an explosion of promising diagnostic and therapeutic agents that hold great promise. Here, we summarize our current understanding of HNSCC biology, including a review of recent sequencing analyses, and identify promising areas for potential diagnostic and therapeutic agents. PMID:26568543
Method for high resolution magnetic resonance analysis using magic angle technique
Wind, Robert A.; Hu, Jian Zhi
2003-11-25
A method of performing a magnetic resonance analysis of a biological object that includes placing the biological object in a main magnetic field and in a radio frequency field, the main magnetic field having a static field direction; rotating the biological object at a rotational frequency of less than about 100 Hz around an axis positioned at an angle of about 54.degree.44' relative to the main magnetic static field direction; pulsing the radio frequency to provide a sequence that includes a magic angle turning pulse segment; and collecting data generated by the pulsed radio frequency. According to another embodiment, the radio frequency is pulsed to provide a sequence capable of producing a spectrum that is substantially free of spinning sideband peaks.
Parametric inference for biological sequence analysis.
Pachter, Lior; Sturmfels, Bernd
2004-11-16
One of the major successes in computational biology has been the unification, by using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied to these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.
Chavshin, Ali Reza; Oshaghi, Mohammad Ali; Vatandoost, Hasan; Hanafi-Bojd, Ahmad Ali; Raeisi, Ahmad; Nikpoor, Fatemeh
2014-01-01
Objective To identify the biological forms, sporozoite rate and molecular characterization of the Anopheles stephensi (An. stephensi) in Hormozgan and Sistan-Baluchistan provinces, the most important malarious areas in Iran. Methods Wild live An. stephensi samples were collected from different malarious areas in southern Iran. The biological forms were identified based on number of egg-ridges. Molecular characterization of biological forms was verified by analysis of the mitochondrial cytochrome oxidase subunit I and II (mtDNA-COI/COII). The Plasmodium infection was examined in the wild female specimens by species-specific nested–PCR method. Results Results showed that all three biological forms including mysorensis, intermediate and type are present in the study areas. Molecular investigations revealed no genetic variation between mtDNA COI/COII sequences of the biological forms and no Plasmodium parasites was detected in the collected mosquito samples. Conclusions Presence of three biological forms with identical sequences showed that the known biological forms belong to a single taxon and the various vectorial capacities reported for these forms are more likely corresponded to other epidemiological factors than to the morphotype of the populations. Lack of malaria parasite infection in An. stephensi, the most important vector of malaria, may be partly due to the success and achievement of ongoing active malaria control program in the region. PMID:24144130
High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA.
Chandrananda, Dineika; Thorne, Natalie P; Bahlo, Melanie
2015-06-17
High-throughput sequencing of cell-free DNA fragments found in human plasma has been used to non-invasively detect fetal aneuploidy, monitor organ transplants and investigate tumor DNA. However, many biological properties of this extracellular genetic material remain unknown. Research that further characterizes circulating DNA could substantially increase its diagnostic value by allowing the application of more sophisticated bioinformatics tools that lead to an improved signal to noise ratio in the sequencing data. In this study, we investigate various features of cell-free DNA in plasma using deep-sequencing data from two pregnant women (>70X, >50X) and compare them with matched cellular DNA. We utilize a descriptive approach to examine how the biological cleavage of cell-free DNA affects different sequence signatures such as fragment lengths, sequence motifs at fragment ends and the distribution of cleavage sites along the genome. We show that the size distributions of these cell-free DNA molecules are dependent on their autosomal and mitochondrial origin as well as the genomic location within chromosomes. DNA mapping to particular microsatellites and alpha repeat elements display unique size signatures. We show how cell-free fragments occur in clusters along the genome, localizing to nucleosomal arrays and are preferentially cleaved at linker regions by correlating the mapping locations of these fragments with ENCODE annotation of chromatin organization. Our work further demonstrates that cell-free autosomal DNA cleavage is sequence dependent. The region spanning up to 10 positions on either side of the DNA cleavage site show a consistent pattern of preference for specific nucleotides. This sequence motif is present in cleavage sites localized to nucleosomal cores and linker regions but is absent in nucleosome-free mitochondrial DNA. These background signals in cell-free DNA sequencing data stem from the non-random biological cleavage of these fragments. This sequence structure can be harnessed to improve bioinformatics algorithms, in particular for CNV and structural variant detection. Descriptive measures for cell-free DNA features developed here could also be used in biomarker analysis to monitor the changes that occur during different pathological conditions.
Pierrel, Jérôme
2012-01-01
The importance of viruses as model organisms is well-established in molecular biology and Max Delbrück's phage group set standards in the DNA phage field. In this paper, I argue that RNA phages, discovered in the 1960s, were also instrumental in the making of molecular biology. As part of experimental systems, RNA phages stood for messenger RNA (mRNA), genes and genome. RNA was thought to mediate information transfers between DNA and proteins. Furthermore, RNA was more manageable at the bench than DNA due to the availability of specific RNases, enzymes used as chemical tools to analyse RNA. Finally, RNA phages provided scientists with a pure source of mRNA to investigate the genetic code, genes and even a genome sequence. This paper focuses on Walter Fiers' laboratory at Ghent University (Belgium) and their work on the RNA phage MS2. When setting up his Laboratory of Molecular Biology, Fiers planned a comprehensive study of the virus with a strong emphasis on the issue of structure. In his lab, RNA sequencing, now a little-known technique, evolved gradually from a means to solve the genetic code, to a tool for completing the first genome sequence. Thus, I follow the research pathway of Fiers and his 'RNA phage lab' with their evolving experimental system from 1960 to the late 1970s. This study illuminates two decisive shifts in post-war biology: the emergence of molecular biology as a discipline in the 1960s in Europe and of genomics in the 1990s.
A topological approach for protein classification
Cang, Zixuan; Mu, Lin; Wu, Kedi; ...
2015-11-04
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
ERIC Educational Resources Information Center
Shah, Kushani; Thomas, Shelby; Stein, Arnold
2013-01-01
In this report, we describe a 5-week laboratory exercise for undergraduate biology and biochemistry students in which students learn to sequence DNA and to genotype their DNA for selected single nucleotide polymorphisms (SNPs). Students use miniaturized DNA sequencing gels that require approximately 8 min to run. The students perform G, A, T, C…
Bacterial Genome Engineering and Synthetic Biology: Combating Pathogens
2016-11-04
engineering and SB methods such as recombineering, clustered regularly interspaced short palindromic repeats ( CRISPR ), and bacterial cell-cell...Cholera# Yersinia pseudotuberculosis# Staphylococcus aureus* Phage Engineering CRISPR /Cas9 Delivery of CRISPR genes and RNA guides for sequence...bear very close sequence alignment to the harmless strains via the use of the CRISPR /Cas9 system. The CRISPR system specifically targets a DNA sequence
Sequencing Conservation Actions Through Threat Assessments in the Southeastern United States
Robert D. Sutter; Christopher C. Szell
2006-01-01
The identification of conservation priorities is one of the leading issues in conservation biology. We present a project of The Nature Conservancy, called Sequencing Conservation Actions, which prioritizes conservation areas and identifies foci for crosscutting strategies at various geographic scales. We use the term âSequencingâ to mean an ordering of actions over...
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.
Bolleman, Jerven T; Mungall, Christopher J; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J P; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; Cock, Peter J A
2016-06-13
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.
Evolution of Enzyme Superfamilies: Comprehensive Exploration of Sequence-Function Relationships.
Baier, F; Copp, J N; Tokuriki, N
2016-11-22
The sequence and functional diversity of enzyme superfamilies have expanded through billions of years of evolution from a common ancestor. Understanding how protein sequence and functional "space" have expanded, at both the evolutionary and molecular level, is central to biochemistry, molecular biology, and evolutionary biology. Integrative approaches that examine protein sequence, structure, and function have begun to provide comprehensive views of the functional diversity and evolutionary relationships within enzyme superfamilies. In this review, we outline the recent advances in our understanding of enzyme evolution and superfamily functional diversity. We describe the tools that have been used to comprehensively analyze sequence relationships and to characterize sequence and function relationships. We also highlight recent large-scale experimental approaches that systematically determine the activity profiles across enzyme superfamilies. We identify several intriguing insights from this recent body of work. First, promiscuous activities are prevalent among extant enzymes. Second, many divergent proteins retain "function connectivity" via enzyme promiscuity, which can be used to probe the evolutionary potential and history of enzyme superfamilies. Finally, we discuss open questions regarding the intricacies of enzyme divergence, as well as potential research directions that will deepen our understanding of enzyme superfamily evolution.
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; ...
2016-06-13
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less
Glycan fragment database: a database of PDB-based glycan 3D structures.
Jo, Sunhwan; Im, Wonpil
2013-01-01
The glycan fragment database (GFDB), freely available at http://www.glycanstructure.org, is a database of the glycosidic torsion angles derived from the glycan structures in the Protein Data Bank (PDB). Analogous to protein structure, the structure of an oligosaccharide chain in a glycoprotein, referred to as a glycan, can be characterized by the torsion angles of glycosidic linkages between relatively rigid carbohydrate monomeric units. Knowledge of accessible conformations of biologically relevant glycans is essential in understanding their biological roles. The GFDB provides an intuitive glycan sequence search tool that allows the user to search complex glycan structures. After a glycan search is complete, each glycosidic torsion angle distribution is displayed in terms of the exact match and the fragment match. The exact match results are from the PDB entries that contain the glycan sequence identical to the query sequence. The fragment match results are from the entries with the glycan sequence whose substructure (fragment) or entire sequence is matched to the query sequence, such that the fragment results implicitly include the influences from the nearby carbohydrate residues. In addition, clustering analysis based on the torsion angle distribution can be performed to obtain the representative structures among the searched glycan structures.
Characterizing the D2 statistic: word matches in biological sequences.
Forêt, Sylvain; Wilson, Susan R; Burden, Conrad J
2009-01-01
Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between two sequences. Recent advances have been made in the characterization of this statistic and in the approximation of its distribution. Here, these results are extended to the case of approximate word matches. We compute the exact value of the variance of the D2 statistic for the case of a uniform letter distribution, and introduce a method to provide accurate approximations of the variance in the remaining cases. This enables the distribution of D2 to be approximated for typical situations arising in biological research. We apply these results to the identification of cis-regulatory modules, and show that this method detects such sequences with a high accuracy. The ability to approximate the distribution of D2 for both exact and approximate word matches will enable the use of this statistic in a more precise manner for sequence comparison, database searches, and identification of transcription factor binding sites.
Clustering and visualizing similarity networks of membrane proteins.
Hu, Geng-Ming; Mai, Te-Lun; Chen, Chi-Ming
2015-08-01
We proposed a fast and unsupervised clustering method, minimum span clustering (MSC), for analyzing the sequence-structure-function relationship of biological networks, and demonstrated its validity in clustering the sequence/structure similarity networks (SSN) of 682 membrane protein (MP) chains. The MSC clustering of MPs based on their sequence information was found to be consistent with their tertiary structures and functions. For the largest seven clusters predicted by MSC, the consistency in chain function within the same cluster is found to be 100%. From analyzing the edge distribution of SSN for MPs, we found a characteristic threshold distance for the boundary between clusters, over which SSN of MPs could be properly clustered by an unsupervised sparsification of the network distance matrix. The clustering results of MPs from both MSC and the unsupervised sparsification methods are consistent with each other, and have high intracluster similarity and low intercluster similarity in sequence, structure, and function. Our study showed a strong sequence-structure-function relationship of MPs. We discussed evidence of convergent evolution of MPs and suggested applications in finding structural similarities and predicting biological functions of MP chains based on their sequence information. © 2015 Wiley Periodicals, Inc.
FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco
Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less
Evolutionary distances in the twilight zone--a rational kernel approach.
Schwarz, Roland F; Fletcher, William; Förster, Frank; Merget, Benjamin; Wolf, Matthias; Schultz, Jörg; Markowetz, Florian
2010-12-31
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
Efficient use of unlabeled data for protein sequence classification: a comparative study.
Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir
2009-04-29
Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.
Feng, Hao; Conneely, Karen N.; Wu, Hao
2014-01-01
DNA methylation is an important epigenetic modification that has essential roles in cellular processes including gene regulation, development and disease and is widely dysregulated in most types of cancer. Recent advances in sequencing technology have enabled the measurement of DNA methylation at single nucleotide resolution through methods such as whole-genome bisulfite sequencing and reduced representation bisulfite sequencing. In DNA methylation studies, a key task is to identify differences under distinct biological contexts, for example, between tumor and normal tissue. A challenge in sequencing studies is that the number of biological replicates is often limited by the costs of sequencing. The small number of replicates leads to unstable variance estimation, which can reduce accuracy to detect differentially methylated loci (DML). Here we propose a novel statistical method to detect DML when comparing two treatment groups. The sequencing counts are described by a lognormal-beta-binomial hierarchical model, which provides a basis for information sharing across different CpG sites. A Wald test is developed for hypothesis testing at each CpG site. Simulation results show that the proposed method yields improved DML detection compared to existing methods, particularly when the number of replicates is low. The proposed method is implemented in the Bioconductor package DSS. PMID:24561809
Genomics and breeding in food crops
USDA-ARS?s Scientific Manuscript database
Plant biology is in the midst of a revolution. The generation of tremendous volumes of sequence information introduce new technical challenges into plant biology and agriculture. The relatively new field of bioinformatics addresses these challenges by utilizing efficient data management strategies;...
Agricultural biodiversity in the post-genomics era
USDA-ARS?s Scientific Manuscript database
The toolkit available for assessing and utilizing biological diversity within agricultural systems is rapidly expanding. In particular, genome and transcriptome re-sequencing as well as genome complexity reduction techniques are gaining popularity as the cost of generating short read sequence data d...
2013-01-01
Background The revolution in DNA sequencing technology continues unabated, and is affecting all aspects of the biological and medical sciences. The training and recruitment of the next generation of researchers who are able to use and exploit the new technology is severely lacking and potentially negatively influencing research and development efforts to advance genome biology. Here we present a cross-disciplinary course that provides undergraduate students with practical experience in running a next generation sequencing instrument through to the analysis and annotation of the generated DNA sequences. Results Many labs across world are installing next generation sequencing technology and we show that the undergraduate students produce quality sequence data and were excited to participate in cutting edge research. The students conducted the work flow from DNA extraction, library preparation, running the sequencing instrument, to the extraction and analysis of the data. They sequenced microbes, metagenomes, and a marine mammal, the Californian sea lion, Zalophus californianus. The students met sequencing quality controls, had no detectable contamination in the targeted DNA sequences, provided publication quality data, and became part of an international collaboration to investigate carcinomas in carnivores. Conclusions Students learned important skills for their future education and career opportunities, and a perceived increase in students’ ability to conduct independent scientific research was measured. DNA sequencing is rapidly expanding in the life sciences. Teaching undergraduates to use the latest technology to sequence genomic DNA ensures they are ready to meet the challenges of the genomic era and allows them to participate in annotating the tree of life. PMID:24007365
Loreille, Odile; Ratnayake, Shashikala; Stockwell, Timothy B.; Mallick, Swapan; Skoglund, Pontus; Onorato, Anthony J.; Bergman, Nicholas H.; Reich, David; Irwin, Jodi A.
2018-01-01
High throughput sequencing (HTS) has been used for a number of years in the field of paleogenomics to facilitate the recovery of small DNA fragments from ancient specimens. Recently, these techniques have also been applied in forensics, where they have been used for the recovery of mitochondrial DNA sequences from samples where traditional PCR-based assays fail because of the very short length of endogenous DNA molecules. Here, we describe the biological sexing of a ~4000-year-old Egyptian mummy using shotgun sequencing and two established methods of biological sex determination (RX and RY), by way of mitochondrial genome analysis as a means of sequence data authentication. This particular case of historical interest increases the potential utility of HTS techniques for forensic purposes by demonstrating that data from the more discriminatory nuclear genome can be recovered from the most damaged specimens, even in cases where mitochondrial DNA cannot be recovered with current PCR-based forensic technologies. Although additional work remains to be done before nuclear DNA recovered via these methods can be used routinely in operational casework for individual identification purposes, these results indicate substantial promise for the retrieval of probative individually identifying DNA data from the most limited and degraded forensic specimens. PMID:29494531
Gasc, Cyrielle; Peyretaillade, Eric
2016-01-01
Abstract The recent expansion of next-generation sequencing has significantly improved biological research. Nevertheless, deep exploration of genomes or metagenomic samples remains difficult because of the sequencing depth and the associated costs required. Therefore, different partitioning strategies have been developed to sequence informative subsets of studied genomes. Among these strategies, hybridization capture has proven to be an innovative and efficient tool for targeting and enriching specific biomarkers in complex DNA mixtures. It has been successfully applied in numerous areas of biology, such as exome resequencing for the identification of mutations underlying Mendelian or complex diseases and cancers, and its usefulness has been demonstrated in the agronomic field through the linking of genetic variants to agricultural phenotypic traits of interest. Moreover, hybridization capture has provided access to underexplored, but relevant fractions of genomes through its ability to enrich defined targets and their flanking regions. Finally, on the basis of restricted genomic information, this method has also allowed the expansion of knowledge of nonreference species and ancient genomes and provided a better understanding of metagenomic samples. In this review, we present the major advances and discoveries permitted by hybridization capture and highlight the potency of this approach in all areas of biology. PMID:27105841
Zhu, Hui; Wang, Wen-Xiu; Wang, Bao-Qin; Zhu, Xiao-Fu; Wu, Xu-Jin; Ma, Qing-Yi; Chen, De-Kun
2012-06-29
The giant panda (Ailuropoda melanoleuca) is an endangered species and indigenous to China. Interferon-gamma (IFN-γ) is the only member of type □ IFN and is vital for the regulation of host adapted immunity and inflammatory response. Little is known aboutthe FN-γ gene and its roles in giant panda.In this study, IFN-γ gene of Qinling giant panda was amplified from total blood RNA by RT-CPR, cloned, sequenced and analysed. The open reading frame (ORF) of Qinling giant panda IFN-γ encodes 152 amino acidsand is highly similar to Sichuan giant panda with an identity of 99.3% in cDNA sequence. The IFN-γ cDNA sequence was ligated to the pET32a vector and transformed into E. coli BL21 competent cells. Expression of recombinant IFN-γ protein of Qinling giant panda in E. coli was confirmed by SDS-PAGE and Western blot analysis. Biological activity assay indicated that the recombinant IFN-γ protein at the concentration of 4-10 µg/ml activated the giant panda peripheral blood lymphocytes,while at 12 µg/mlinhibited. the activation of the lymphocytes.These findings provide insights into the evolution of giant panda IFN-γ and information regarding amino acid residues essential for their biological activity.
Gasc, Cyrielle; Peyretaillade, Eric; Peyret, Pierre
2016-06-02
The recent expansion of next-generation sequencing has significantly improved biological research. Nevertheless, deep exploration of genomes or metagenomic samples remains difficult because of the sequencing depth and the associated costs required. Therefore, different partitioning strategies have been developed to sequence informative subsets of studied genomes. Among these strategies, hybridization capture has proven to be an innovative and efficient tool for targeting and enriching specific biomarkers in complex DNA mixtures. It has been successfully applied in numerous areas of biology, such as exome resequencing for the identification of mutations underlying Mendelian or complex diseases and cancers, and its usefulness has been demonstrated in the agronomic field through the linking of genetic variants to agricultural phenotypic traits of interest. Moreover, hybridization capture has provided access to underexplored, but relevant fractions of genomes through its ability to enrich defined targets and their flanking regions. Finally, on the basis of restricted genomic information, this method has also allowed the expansion of knowledge of nonreference species and ancient genomes and provided a better understanding of metagenomic samples. In this review, we present the major advances and discoveries permitted by hybridization capture and highlight the potency of this approach in all areas of biology. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Development as adaptation: a paradigm for gravitational and space biology
NASA Technical Reports Server (NTRS)
Alberts, Jeffrey R.; Ronca, April E.
2005-01-01
Adaptation is a central precept of biology; it provides a framework for identifying functional significance. We equate mammalian development with adaptation, by viewing the developmental sequence as a series of adaptations to a stereotyped sequence of habitats. In this way development is adaptation. The Norway rat is used as a mammalian model, and the sequence of habitats that is used to define its adaptive-developmental sequence is (a) the uterus, (b) the mother's body, (c) the huddle, and (d) the coterie of pups as they gain independence. Then, within this framework and in relation to each of the habitats, we consider problems of organismal responses to altered gravitational forces (micro-g to hyper-g), especially those encountered during space flight and centrifugation. This approach enables a clearer identification of simple "effects" and active "responses" with respect to gravity. It focuses our attention on functional systems and brings to the fore the manner in which experience shapes somatic adaptation. We argue that this basic developmental approach is not only central to basic issues in gravitational biology, but that it provides a natural tool for understanding the underlying processes that are vital to astronaut health and well-being during long duration flights that will involve adaptation to space flight conditions and eventual re-adaptation to Earth's gravity.
Kuraku, Shigehiro; Zmasek, Christian M; Nishimura, Osamu; Katoh, Kazutaka
2013-07-01
We report a new web server, aLeaves (http://aleaves.cdb.riken.jp/), for homologue collection from diverse animal genomes. In molecular comparative studies involving multiple species, orthology identification is the basis on which most subsequent biological analyses rely. It can be achieved most accurately by explicit phylogenetic inference. More and more species are subjected to large-scale sequencing, but the resultant resources are scattered in independent project-based, and multi-species, but separate, web sites. This complicates data access and is becoming a serious barrier to the comprehensiveness of molecular phylogenetic analysis. aLeaves, launched to overcome this difficulty, collects sequences similar to an input query sequence from various data sources. The collected sequences can be passed on to the MAFFT sequence alignment server (http://mafft.cbrc.jp/alignment/server/), which has been significantly improved in interactivity. This update enables to switch between (i) sequence selection using the Archaeopteryx tree viewer, (ii) multiple sequence alignment and (iii) tree inference. This can be performed as a loop until one reaches a sensible data set, which minimizes redundancy for better visibility and handling in phylogenetic inference while covering relevant taxa. The work flow achieved by the seamless link between aLeaves and MAFFT provides a convenient online platform to address various questions in zoology and evolutionary biology.
Kuraku, Shigehiro; Zmasek, Christian M.; Nishimura, Osamu; Katoh, Kazutaka
2013-01-01
We report a new web server, aLeaves (http://aleaves.cdb.riken.jp/), for homologue collection from diverse animal genomes. In molecular comparative studies involving multiple species, orthology identification is the basis on which most subsequent biological analyses rely. It can be achieved most accurately by explicit phylogenetic inference. More and more species are subjected to large-scale sequencing, but the resultant resources are scattered in independent project-based, and multi-species, but separate, web sites. This complicates data access and is becoming a serious barrier to the comprehensiveness of molecular phylogenetic analysis. aLeaves, launched to overcome this difficulty, collects sequences similar to an input query sequence from various data sources. The collected sequences can be passed on to the MAFFT sequence alignment server (http://mafft.cbrc.jp/alignment/server/), which has been significantly improved in interactivity. This update enables to switch between (i) sequence selection using the Archaeopteryx tree viewer, (ii) multiple sequence alignment and (iii) tree inference. This can be performed as a loop until one reaches a sensible data set, which minimizes redundancy for better visibility and handling in phylogenetic inference while covering relevant taxa. The work flow achieved by the seamless link between aLeaves and MAFFT provides a convenient online platform to address various questions in zoology and evolutionary biology. PMID:23677614
Information capacity of nucleotide sequences and its applications.
Sadovsky, M G
2006-05-01
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.
Febrer, Melanie; Goicoechea, Jose Luis; Wright, Jonathan; McKenzie, Neil; Song, Xiang; Lin, Jinke; Collura, Kristi; Wissotski, Marina; Yu, Yeisoo; Ammiraju, Jetty S. S.; Wolny, Elzbieta; Idziak, Dominika; Betekhtin, Alexander; Kudrna, Dave; Hasterok, Robert; Wing, Rod A.; Bevan, Michael W.
2010-01-01
The pooid subfamily of grasses includes some of the most important crop, forage and turf species, such as wheat, barley and Lolium. Developing genomic resources, such as whole-genome physical maps, for analysing the large and complex genomes of these crops and for facilitating biological research in grasses is an important goal in plant biology. We describe a bacterial artificial chromosome (BAC)-based physical map of the wild pooid grass Brachypodium distachyon and integrate this with whole genome shotgun sequence (WGS) assemblies using BAC end sequences (BES). The resulting physical map contains 26 contigs spanning the 272 Mb genome. BES from the physical map were also used to integrate a genetic map. This provides an independent vaildation and confirmation of the published WGS assembly. Mapped BACs were used in Fluorescence In Situ Hybridisation (FISH) experiments to align the integrated physical map and sequence assemblies to chromosomes with high resolution. The physical, genetic and cytogenetic maps, integrated with whole genome shotgun sequence assemblies, enhance the accuracy and durability of this important genome sequence and will directly facilitate gene isolation. PMID:20976139
Townsley, Brad T; Covington, Michael F; Ichihashi, Yasunori; Zumstein, Kristina; Sinha, Neelima R
2015-01-01
Next Generation Sequencing (NGS) is driving rapid advancement in biological understanding and RNA-sequencing (RNA-seq) has become an indispensable tool for biology and medicine. There is a growing need for access to these technologies although preparation of NGS libraries remains a bottleneck to wider adoption. Here we report a novel method for the production of strand specific RNA-seq libraries utilizing the terminal breathing of double-stranded cDNA to capture and incorporate a sequencing adapter. Breath Adapter Directional sequencing (BrAD-seq) reduces sample handling and requires far fewer enzymatic steps than most available methods to produce high quality strand-specific RNA-seq libraries. The method we present is optimized for 3-prime Digital Gene Expression (DGE) libraries and can easily extend to full transcript coverage shotgun (SHO) type strand-specific libraries and is modularized to accommodate a diversity of RNA and DNA input materials. BrAD-seq offers a highly streamlined and inexpensive option for RNA-seq libraries.
Jayakumar, Amal; Chang, Bonnie X; Widner, Brittany; Bernhardt, Peter; Mulholland, Margaret R; Ward, Bess B
2017-10-01
Biological nitrogen fixation (BNF) was investigated above and within the oxygen-depleted waters of the oxygen-minimum zone of the Eastern Tropical North Pacific Ocean. BNF rates were estimated using an isotope tracer method that overcame the uncertainty of the conventional bubble method by directly measuring the tracer enrichment during the incubations. Highest rates of BNF (~4 nM day -1 ) occurred in coastal surface waters and lowest detectable rates (~0.2 nM day -1 ) were found in the anoxic region of offshore stations. BNF was not detectable in most samples from oxygen-depleted waters. The composition of the N 2 -fixing assemblage was investigated by sequencing of nifH genes. The diazotrophic assemblage in surface waters contained mainly Proteobacterial sequences (Cluster I nifH), while both Proteobacterial sequences and sequences with high identities to those of anaerobic microbes characterized as Clusters III and IV type nifH sequences were found in the anoxic waters. Our results indicate modest input of N through BNF in oxygen-depleted zones mainly due to the activity of proteobacterial diazotrophs.
Pragmatic turn in biology: From biological molecules to genetic content operators.
Witzany, Guenther
2014-08-26
Erwin Schrödinger's question "What is life?" received the answer for decades of "physics + chemistry". The concepts of Alain Turing and John von Neumann introduced a third term: "information". This led to the understanding of nucleic acid sequences as a natural code. Manfred Eigen adapted the concept of Hammings "sequence space". Similar to Hilbert space, in which every ontological entity could be defined by an unequivocal point in a mathematical axiomatic system, in the abstract "sequence space" concept each point represents a unique syntactic structure and the value of their separation represents their dissimilarity. In this concept molecular features of the genetic code evolve by means of self-organisation of matter. Biological selection determines the fittest types among varieties of replication errors of quasi-species. The quasi-species concept dominated evolution theory for many decades. In contrast to this, recent empirical data on the evolution of DNA and its forerunners, the RNA-world and viruses indicate cooperative agent-based interactions. Group behaviour of quasi-species consortia constitute de novo and arrange available genetic content for adaptational purposes within real-life contexts that determine epigenetic markings. This review focuses on some fundamental changes in biology, discarding its traditional status as a subdiscipline of physics and chemistry.
Networking Omic Data to Envisage Systems Biological Regulation.
Kalapanulak, Saowalak; Saithong, Treenut; Thammarongtham, Chinae
To understand how biological processes work, it is necessary to explore the systematic regulation governing the behaviour of the processes. Not only driving the normal behavior of organisms, the systematic regulation evidently underlies the temporal responses to surrounding environments (dynamics) and long-term phenotypic adaptation (evolution). The systematic regulation is, in effect, formulated from the regulatory components which collaboratively work together as a network. In the drive to decipher such a code of lives, a spectrum of technologies has continuously been developed in the post-genomic era. With current advances, high-throughput sequencing technologies are tremendously powerful for facilitating genomics and systems biology studies in the attempt to understand system regulation inside the cells. The ability to explore relevant regulatory components which infer transcriptional and signaling regulation, driving core cellular processes, is thus enhanced. This chapter reviews high-throughput sequencing technologies, including second and third generation sequencing technologies, which support the investigation of genomics and transcriptomics data. Utilization of this high-throughput data to form the virtual network of systems regulation is explained, particularly transcriptional regulatory networks. Analysis of the resulting regulatory networks could lead to an understanding of cellular systems regulation at the mechanistic and dynamics levels. The great contribution of the biological networking approach to envisage systems regulation is finally demonstrated by a broad range of examples.
EnsMart: A Generic System for Fast and Flexible Access to Biological Data
Kasprzyk, Arek; Keefe, Damian; Smedley, Damian; London, Darin; Spooner, William; Melsopp, Craig; Hammond, Martin; Rocca-Serra, Philippe; Cox, Tony; Birney, Ewan
2004-01-01
The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to `non-Ensembl' data sets. PMID:14707178
DOE Office of Scientific and Technical Information (OSTI.GOV)
Loots, G G; Ovcharenko, I; Collette, N
2007-02-26
Generating the sequence of the human genome represents a colossal achievement for science and mankind. The technical use for the human genome project information holds great promise to cure disease, prevent bioterror threats, as well as to learn about human origins. Yet converting the sequence data into biological meaningful information has not been immediately obvious, and we are still in the preliminary stages of understanding how the genome is organized, what are the functional building blocks and how do these sequences mediate complex biological processes. The overarching goal of this program was to develop novel methods and high throughput strategiesmore » for determining the functions of ''anonymous'' human genes that are evolutionarily deeply conserved in other vertebrates. We coupled analytical tool development and computational predictions regarding gene function with novel high throughput experimental strategies and tested biological predictions in the laboratory. The tools required for comparative genomic data-mining are fundamentally the same whether they are applied to scientific studies of related microbes or the search for functions of novel human genes. For this reason the tools, conceptual framework and the coupled informatics-experimental biology paradigm we developed in this LDRD has many potential scientific applications relevant to LLNL multidisciplinary research in bio-defense, bioengineering, bionanosciences and microbial and environmental genomics.« less
The sequence of sequencers: The history of sequencing DNA.
Heather, James M; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Transcriptome-based differentiation of closely-related Miscanthus lines.
Chouvarine, Philippe; Cooksey, Amanda M; McCarthy, Fiona M; Ray, David A; Baldwin, Brian S; Burgess, Shane C; Peterson, Daniel G
2012-01-01
Distinguishing between individuals is critical to those conducting animal/plant breeding, food safety/quality research, diagnostic and clinical testing, and evolutionary biology studies. Classical genetic identification studies are based on marker polymorphisms, but polymorphism-based techniques are time and labor intensive and often cannot distinguish between closely related individuals. Illumina sequencing technologies provide the detailed sequence data required for rapid and efficient differentiation of related species, lines/cultivars, and individuals in a cost-effective manner. Here we describe the use of Illumina high-throughput exome sequencing, coupled with SNP mapping, as a rapid means of distinguishing between related cultivars of the lignocellulosic bioenergy crop giant miscanthus (Miscanthus × giganteus). We provide the first exome sequence database for Miscanthus species complete with Gene Ontology (GO) functional annotations. A SNP comparative analysis of rhizome-derived cDNA sequences was successfully utilized to distinguish three Miscanthus × giganteus cultivars from each other and from other Miscanthus species. Moreover, the resulting phylogenetic tree generated from SNP frequency data parallels the known breeding history of the plants examined. Some of the giant miscanthus plants exhibit considerable sequence divergence. Here we describe an analysis of Miscanthus in which high-throughput exome sequencing was utilized to differentiate between closely related genotypes despite the current lack of a reference genome sequence. We functionally annotated the exome sequences and provide resources to support Miscanthus systems biology. In addition, we demonstrate the use of the commercial high-performance cloud computing to do computational GO annotation.
Interpreting Microbial Biosynthesis in the Genomic Age: Biological and Practical Considerations
Miller, Ian J.; Chevrette, Marc G.; Kwan, Jason C.
2017-01-01
Genome mining has become an increasingly powerful, scalable, and economically accessible tool for the study of natural product biosynthesis and drug discovery. However, there remain important biological and practical problems that can complicate or obscure biosynthetic analysis in genomic and metagenomic sequencing projects. Here, we focus on limitations of available technology as well as computational and experimental strategies to overcome them. We review the unique challenges and approaches in the study of symbiotic and uncultured systems, as well as those associated with biosynthetic gene cluster (BGC) assembly and product prediction. Finally, to explore sequencing parameters that affect the recovery and contiguity of large and repetitive BGCs assembled de novo, we simulate Illumina and PacBio sequencing of the Salinispora tropica genome focusing on assembly of the salinilactam (slm) BGC. PMID:28587290
The genome sequence of taurine cattle: a window to ruminant biology and evolution.
Elsik, Christine G; Tellam, Ross L; Worley, Kim C; Gibbs, Richard A; Muzny, Donna M; Weinstock, George M; Adelson, David L; Eichler, Evan E; Elnitski, Laura; Guigó, Roderic; Hamernik, Debora L; Kappes, Steve M; Lewin, Harris A; Lynn, David J; Nicholas, Frank W; Reymond, Alexandre; Rijnkels, Monique; Skow, Loren C; Zdobnov, Evgeny M; Schook, Lawrence; Womack, James; Alioto, Tyler; Antonarakis, Stylianos E; Astashyn, Alex; Chapple, Charles E; Chen, Hsiu-Chuan; Chrast, Jacqueline; Câmara, Francisco; Ermolaeva, Olga; Henrichsen, Charlotte N; Hlavina, Wratko; Kapustin, Yuri; Kiryutin, Boris; Kitts, Paul; Kokocinski, Felix; Landrum, Melissa; Maglott, Donna; Pruitt, Kim; Sapojnikov, Victor; Searle, Stephen M; Solovyev, Victor; Souvorov, Alexandre; Ucla, Catherine; Wyss, Carine; Anzola, Juan M; Gerlach, Daniel; Elhaik, Eran; Graur, Dan; Reese, Justin T; Edgar, Robert C; McEwan, John C; Payne, Gemma M; Raison, Joy M; Junier, Thomas; Kriventseva, Evgenia V; Eyras, Eduardo; Plass, Mireya; Donthu, Ravikiran; Larkin, Denis M; Reecy, James; Yang, Mary Q; Chen, Lin; Cheng, Ze; Chitko-McKown, Carol G; Liu, George E; Matukumalli, Lakshmi K; Song, Jiuzhou; Zhu, Bin; Bradley, Daniel G; Brinkman, Fiona S L; Lau, Lilian P L; Whiteside, Matthew D; Walker, Angela; Wheeler, Thomas T; Casey, Theresa; German, J Bruce; Lemay, Danielle G; Maqbool, Nauman J; Molenaar, Adrian J; Seo, Seongwon; Stothard, Paul; Baldwin, Cynthia L; Baxter, Rebecca; Brinkmeyer-Langford, Candice L; Brown, Wendy C; Childers, Christopher P; Connelley, Timothy; Ellis, Shirley A; Fritz, Krista; Glass, Elizabeth J; Herzig, Carolyn T A; Iivanainen, Antti; Lahmers, Kevin K; Bennett, Anna K; Dickens, C Michael; Gilbert, James G R; Hagen, Darren E; Salih, Hanni; Aerts, Jan; Caetano, Alexandre R; Dalrymple, Brian; Garcia, Jose Fernando; Gill, Clare A; Hiendleder, Stefan G; Memili, Erdogan; Spurlock, Diane; Williams, John L; Alexander, Lee; Brownstein, Michael J; Guan, Leluo; Holt, Robert A; Jones, Steven J M; Marra, Marco A; Moore, Richard; Moore, Stephen S; Roberts, Andy; Taniguchi, Masaaki; Waterman, Richard C; Chacko, Joseph; Chandrabose, Mimi M; Cree, Andy; Dao, Marvin Diep; Dinh, Huyen H; Gabisi, Ramatu Ayiesha; Hines, Sandra; Hume, Jennifer; Jhangiani, Shalini N; Joshi, Vandita; Kovar, Christie L; Lewis, Lora R; Liu, Yih-Shin; Lopez, John; Morgan, Margaret B; Nguyen, Ngoc Bich; Okwuonu, Geoffrey O; Ruiz, San Juana; Santibanez, Jireh; Wright, Rita A; Buhay, Christian; Ding, Yan; Dugan-Rocha, Shannon; Herdandez, Judith; Holder, Michael; Sabo, Aniko; Egan, Amy; Goodell, Jason; Wilczek-Boney, Katarzyna; Fowler, Gerald R; Hitchens, Matthew Edward; Lozado, Ryan J; Moen, Charles; Steffen, David; Warren, James T; Zhang, Jingkun; Chiu, Readman; Schein, Jacqueline E; Durbin, K James; Havlak, Paul; Jiang, Huaiyang; Liu, Yue; Qin, Xiang; Ren, Yanru; Shen, Yufeng; Song, Henry; Bell, Stephanie Nicole; Davis, Clay; Johnson, Angela Jolivet; Lee, Sandra; Nazareth, Lynne V; Patel, Bella Mayurkumar; Pu, Ling-Ling; Vattathil, Selina; Williams, Rex Lee; Curry, Stacey; Hamilton, Cerissa; Sodergren, Erica; Wheeler, David A; Barris, Wes; Bennett, Gary L; Eggen, André; Green, Ronnie D; Harhay, Gregory P; Hobbs, Matthew; Jann, Oliver; Keele, John W; Kent, Matthew P; Lien, Sigbjørn; McKay, Stephanie D; McWilliam, Sean; Ratnakumar, Abhirami; Schnabel, Robert D; Smith, Timothy; Snelling, Warren M; Sonstegard, Tad S; Stone, Roger T; Sugimoto, Yoshikazu; Takasuga, Akiko; Taylor, Jeremy F; Van Tassell, Curtis P; Macneil, Michael D; Abatepaulo, Antonio R R; Abbey, Colette A; Ahola, Virpi; Almeida, Iassudara G; Amadio, Ariel F; Anatriello, Elen; Bahadue, Suria M; Biase, Fernando H; Boldt, Clayton R; Carroll, Jeffery A; Carvalho, Wanessa A; Cervelatti, Eliane P; Chacko, Elsa; Chapin, Jennifer E; Cheng, Ye; Choi, Jungwoo; Colley, Adam J; de Campos, Tatiana A; De Donato, Marcos; Santos, Isabel K F de Miranda; de Oliveira, Carlo J F; Deobald, Heather; Devinoy, Eve; Donohue, Kaitlin E; Dovc, Peter; Eberlein, Annett; Fitzsimmons, Carolyn J; Franzin, Alessandra M; Garcia, Gustavo R; Genini, Sem; Gladney, Cody J; Grant, Jason R; Greaser, Marion L; Green, Jonathan A; Hadsell, Darryl L; Hakimov, Hatam A; Halgren, Rob; Harrow, Jennifer L; Hart, Elizabeth A; Hastings, Nicola; Hernandez, Marta; Hu, Zhi-Liang; Ingham, Aaron; Iso-Touru, Terhi; Jamis, Catherine; Jensen, Kirsty; Kapetis, Dimos; Kerr, Tovah; Khalil, Sari S; Khatib, Hasan; Kolbehdari, Davood; Kumar, Charu G; Kumar, Dinesh; Leach, Richard; Lee, Justin C-M; Li, Changxi; Logan, Krystin M; Malinverni, Roberto; Marques, Elisa; Martin, William F; Martins, Natalia F; Maruyama, Sandra R; Mazza, Raffaele; McLean, Kim L; Medrano, Juan F; Moreno, Barbara T; Moré, Daniela D; Muntean, Carl T; Nandakumar, Hari P; Nogueira, Marcelo F G; Olsaker, Ingrid; Pant, Sameer D; Panzitta, Francesca; Pastor, Rosemeire C P; Poli, Mario A; Poslusny, Nathan; Rachagani, Satyanarayana; Ranganathan, Shoba; Razpet, Andrej; Riggs, Penny K; Rincon, Gonzalo; Rodriguez-Osorio, Nelida; Rodriguez-Zas, Sandra L; Romero, Natasha E; Rosenwald, Anne; Sando, Lillian; Schmutz, Sheila M; Shen, Libing; Sherman, Laura; Southey, Bruce R; Lutzow, Ylva Strandberg; Sweedler, Jonathan V; Tammen, Imke; Telugu, Bhanu Prakash V L; Urbanski, Jennifer M; Utsunomiya, Yuri T; Verschoor, Chris P; Waardenberg, Ashley J; Wang, Zhiquan; Ward, Robert; Weikard, Rosemarie; Welsh, Thomas H; White, Stephen N; Wilming, Laurens G; Wunderlich, Kris R; Yang, Jianqi; Zhao, Feng-Qi
2009-04-24
To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides a resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production.
Chen, Zhao-xue; Huang, Yun-kun; Sun, Ying
2014-02-01
Associating geometric arrangements of 9 Loshu numbers modulo 5, investigating property of golden rectangles and characteristics of Fibonacci sequence modulo 10 as well as the two subsequences of its modular sequence by modulo 5, the Loshu-Fibonacci Diagram is created based on strict logical deduction in this paper, which can disclose inherent relationship among Taiji sign, Loshu and Fibonacci sequence modulo 10 perfectly and unite such key ideas of holism, symmetry, holographic thought and yin-yang balance pursuit from Chinese medicine as a whole. Based on further analysis and reasoning, the authors discover that taking the golden ratio and Loshu-Fibonacci Diagram as a link, there is profound and universal association existing between researches of Chinese medicine and modern biology.
Information resources at the National Center for Biotechnology Information.
Woodsmall, R M; Benson, D A
1993-01-01
The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, was established in 1988 to perform basic research in the field of computational molecular biology as well as build and distribute molecular biology databases. The basic research has led to new algorithms and analysis tools for interpreting genomic data and has been instrumental in the discovery of human disease genes for neurofibromatosis and Kallmann syndrome. The principal database responsibility is the National Institutes of Health (NIH) genetic sequence database, GenBank. NCBI, in collaboration with international partners, builds, distributes, and provides online and CD-ROM access to over 112,000 DNA sequences. Another major program is the integration of multiple sequences databases and related bibliographic information and the development of network-based retrieval systems for Internet access. PMID:8374583
Single-molecule protein sequencing through fingerprinting: computational assessment
NASA Astrophysics Data System (ADS)
Yao, Yao; Docter, Margreet; van Ginkel, Jetty; de Ridder, Dick; Joo, Chirlmin
2015-10-01
Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences.
Doss, C George Priya; Chakrabarty, Chiranjib; Debajyoti, C; Debottam, S
2014-11-01
Certain mysteries pointing toward their recruitment pathways, cell cycle regulation mechanisms, spindle checkpoint assembly, and chromosome segregation process are considered the centre of attraction in cancer research. In modern times, with the established databases, ranges of computational platforms have provided a platform to examine almost all the physiological and biochemical evidences in disease-associated phenotypes. Using existing computational methods, we have utilized the amino acid residues to understand the similarity within the evolutionary variance of different associated centromere proteins. This study related to sequence similarity, protein-protein networking, co-expression analysis, and evolutionary trajectory of centromere proteins will speed up the understanding about centromere biology and will create a road map for upcoming researchers who are initiating their work of clinical sequencing using centromere proteins.
Complexity and Entropy Analysis of DNMT1 Gene
USDA-ARS?s Scientific Manuscript database
Background: The application of complexity information on DNA sequence and protein in biological processes are well established in this study. Available sequences for DNMT1 gene, which is a maintenance methyltransferase is responsible for copying DNA methylation patterns to the daughter strands durin...
Adverse Outcome Pathway (AOP) Network Development for Fatty Liver
Adverse outcome pathways (AOPs) are descriptive biological sequences that start from a molecular initiating event (MIE) and end with an adverse health outcome. AOPs provide biological context for high throughput chemical testing and further prioritize environmental health risk re...
Parikh, Rohan C; Du, Xianglin L; Robert, Morgan O; Lairson, David R
2017-01-01
Treatment patterns for metastatic colorectal cancer (mCRC) patients have changed considerably over the last decade with the introduction of new chemotherapies and targeted biologics. These treatments are often administered in various sequences with limited evidence regarding their cost-effectiveness. To conduct a pharmacoeconomic evaluation of commonly administered treatment sequences among elderly mCRC patients. A probabilistic discrete event simulation model assuming Weibull distribution was developed to evaluate the cost-effectiveness of the following common treatment sequences: (a) first-line oxaliplatin/irinotecan followed by second-line oxaliplatin/irinotecan + bevacizumab (OI-OIB); (b) first-line oxaliplatin/irinotecan + bevacizumab followed by second-line oxaliplatin/irinotecan + bevacizumab (OIB-OIB); (c) OI-OIB followed by a third-line targeted biologic (OI-OIB-TB); and (d) OIB-OIB followed by a third-line targeted biologic (OIB-OIB-TB). Input parameters for the model were primarily obtained from the Surveillance, Epidemiology, and End Results-Medicare linked dataset for incident mCRC patients aged 65 years and older diagnosed from January 2004 through December 2009. A probabilistic sensitivity analysis was performed to account for parameter uncertainty. Costs (2014 U.S. dollars) and effectiveness were discounted at an annual rate of 3%. In the base case analyses, at the willingness-to-pay (WTP) threshold of $100,000/quality-adjusted life-year (QALY) gained, the treatment sequence OIB-OIB (vs. OI-OIB) was not cost-effective with an incremental cost-effectiveness ratio (ICER) per patient of $119,007/QALY; OI-OIB-TB (vs. OIB-OIB) was dominated; and OIB-OIB-TB (vs. OIB-OIB) was not cost-effective with an ICER of $405,857/QALY. Results similar to the base case analysis were obtained assuming log-normal distribution. Cost-effectiveness acceptability curves derived from a probabilistic sensitivity analysis showed that at a WTP of $100,000/QALY gained, sequence OI-OIB was 34% cost-effective, followed by OIB-OIB (31%), OI-OIB-TB (20%), and OIB-OIB-TB (15%). Overall, survival increases marginally with the addition of targeted biologics, such as bevacizumab, at first line and third line at substantial costs. Treatment sequences with bevacizumab at first line and targeted biologics at third line may not be cost-effective at the commonly used threshold of $100,000/QALY gained, but a marginal decrease in the cost of bevacizumab may make treatment sequences with first-line bevacizumab cost-effective. Future economic evaluations should validate the study results using parameters from ongoing clinical trials. This study was supported in part by a grant from the Agency for Healthcare Research and Quality (R01-HS018956) and in part by a grant from the Cancer Prevention and Research Institute of Texas (RP130051), which were obtained by Du. The authors report no conflicts of interest. Study concept and design were primarily contributed by Parikh, along with the other authors. All authors participated in data collection, and Parikh took the lead in data interpretation and analysis, along with Lairson and Morgan, with assistance from Du. The manuscript was written primarily by Parikh, along with Lairson, Morgan, and Du, and revised by Parikh.
Casimiro, Ana C; Vinga, Susana; Freitas, Ana T; Oliveira, Arlindo L
2008-02-07
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano
2008-01-01
Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975
Variation in Symbiodinium ITS2 sequence assemblages among coral colonies.
Stat, Michael; Bird, Christopher E; Pochon, Xavier; Chasqui, Luis; Chauka, Leonard J; Concepcion, Gregory T; Logan, Dan; Takabayashi, Misaki; Toonen, Robert J; Gates, Ruth D
2011-01-05
Endosymbiotic dinoflagellates in the genus Symbiodinium are fundamentally important to the biology of scleractinian corals, as well as to a variety of other marine organisms. The genus Symbiodinium is genetically and functionally diverse and the taxonomic nature of the union between Symbiodinium and corals is implicated as a key trait determining the environmental tolerance of the symbiosis. Surprisingly, the question of how Symbiodinium diversity partitions within a species across spatial scales of meters to kilometers has received little attention, but is important to understanding the intrinsic biological scope of a given coral population and adaptations to the local environment. Here we address this gap by describing the Symbiodinium ITS2 sequence assemblages recovered from colonies of the reef building coral Montipora capitata sampled across Kāne'ohe Bay, Hawai'i. A total of 52 corals were sampled in a nested design of Coral Colony(Site(Region)) reflecting spatial scales of meters to kilometers. A diversity of Symbiodinium ITS2 sequences was recovered with the majority of variance partitioning at the level of the Coral Colony. To confirm this result, the Symbiodinium ITS2 sequence diversity in six M. capitata colonies were analyzed in much greater depth with 35 to 55 clones per colony. The ITS2 sequences and quantitative composition recovered from these colonies varied significantly, indicating that each coral hosted a different assemblage of Symbiodinium. The diversity of Symbiodinium ITS2 sequence assemblages retrieved from individual colonies of M. capitata here highlights the problems inherent in interpreting multi-copy and intra-genomically variable molecular markers, and serves as a context for discussing the utility and biological relevance of assigning species names based on Symbiodinium ITS2 genotyping.
Variation in Symbiodinium ITS2 Sequence Assemblages among Coral Colonies
Stat, Michael; Bird, Christopher E.; Pochon, Xavier; Chasqui, Luis; Chauka, Leonard J.; Concepcion, Gregory T.; Logan, Dan; Takabayashi, Misaki; Toonen, Robert J.; Gates, Ruth D.
2011-01-01
Endosymbiotic dinoflagellates in the genus Symbiodinium are fundamentally important to the biology of scleractinian corals, as well as to a variety of other marine organisms. The genus Symbiodinium is genetically and functionally diverse and the taxonomic nature of the union between Symbiodinium and corals is implicated as a key trait determining the environmental tolerance of the symbiosis. Surprisingly, the question of how Symbiodinium diversity partitions within a species across spatial scales of meters to kilometers has received little attention, but is important to understanding the intrinsic biological scope of a given coral population and adaptations to the local environment. Here we address this gap by describing the Symbiodinium ITS2 sequence assemblages recovered from colonies of the reef building coral Montipora capitata sampled across Kāne'ohe Bay, Hawai'i. A total of 52 corals were sampled in a nested design of Coral Colony(Site(Region)) reflecting spatial scales of meters to kilometers. A diversity of Symbiodinium ITS2 sequences was recovered with the majority of variance partitioning at the level of the Coral Colony. To confirm this result, the Symbiodinium ITS2 sequence diversity in six M. capitata colonies were analyzed in much greater depth with 35 to 55 clones per colony. The ITS2 sequences and quantitative composition recovered from these colonies varied significantly, indicating that each coral hosted a different assemblage of Symbiodinium. The diversity of Symbiodinium ITS2 sequence assemblages retrieved from individual colonies of M. capitata here highlights the problems inherent in interpreting multi-copy and intra-genomically variable molecular markers, and serves as a context for discussing the utility and biological relevance of assigning species names based on Symbiodinium ITS2 genotyping. PMID:21246044
ERIC Educational Resources Information Center
Smedley, Audrey; Smedley, Brian D.
2005-01-01
Racialized science seeks to explain human population differences in health, intelligence, education, and wealth as the consequence of immutable, biologically based differences between "racial" groups. Recent advances in the sequencing of the human genome and in an understanding of biological correlates of behavior have fueled racialized science,…
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.
Haque, Ashraful; Engel, Jessica; Teichmann, Sarah A; Lönnberg, Tapio
2017-08-18
RNA sequencing (RNA-seq) is a genomic approach for the detection and quantitative analysis of messenger RNA molecules in a biological sample and is useful for studying cellular responses. RNA-seq has fueled much discovery and innovation in medicine over recent years. For practical reasons, the technique is usually conducted on samples comprising thousands to millions of cells. However, this has hindered direct assessment of the fundamental unit of biology-the cell. Since the first single-cell RNA-sequencing (scRNA-seq) study was published in 2009, many more have been conducted, mostly by specialist laboratories with unique skills in wet-lab single-cell genomics, bioinformatics, and computation. However, with the increasing commercial availability of scRNA-seq platforms, and the rapid ongoing maturation of bioinformatics approaches, a point has been reached where any biomedical researcher or clinician can use scRNA-seq to make exciting discoveries. In this review, we present a practical guide to help researchers design their first scRNA-seq studies, including introductory information on experimental hardware, protocol choice, quality control, data analysis and biological interpretation.
A Predictive Approach to Network Reverse-Engineering
NASA Astrophysics Data System (ADS)
Wiggins, Chris
2005-03-01
A central challenge of systems biology is the ``reverse engineering" of transcriptional networks: inferring which genes exert regulatory control over which other genes. Attempting such inference at the genomic scale has only recently become feasible, via data-intensive biological innovations such as DNA microrrays (``DNA chips") and the sequencing of whole genomes. In this talk we present a predictive approach to network reverse-engineering, in which we integrate DNA chip data and sequence data to build a model of the transcriptional network of the yeast S. cerevisiae capable of predicting the response of genes in unseen experiments. The technique can also be used to extract ``motifs,'' sequence elements which act as binding sites for regulatory proteins. We validate by a number of approaches and present comparison of theoretical prediction vs. experimental data, along with biological interpretations of the resulting model. En route, we will illustrate some basic notions in statistical learning theory (fitting vs. over-fitting; cross- validation; assessing statistical significance), highlighting ways in which physicists can make a unique contribution in data- driven approaches to reverse engineering.
Literature classification for semi-automated updating of biological knowledgebases
2013-01-01
Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403
Towards programming languages for genetic engineering of living cells
Pedersen, Michael; Phillips, Andrew
2009-01-01
Synthetic biology aims at producing novel biological systems to carry out some desired and well-defined functions. An ultimate dream is to design these systems at a high level of abstraction using engineering-based tools and programming languages, press a button, and have the design translated to DNA sequences that can be synthesized and put to work in living cells. We introduce such a programming language, which allows logical interactions between potentially undetermined proteins and genes to be expressed in a modular manner. Programs can be translated by a compiler into sequences of standard biological parts, a process that relies on logic programming and prototype databases that contain known biological parts and protein interactions. Programs can also be translated to reactions, allowing simulations to be carried out. While current limitations on available data prevent full use of the language in practical applications, the language can be used to develop formal models of synthetic systems, which are otherwise often presented by informal notations. The language can also serve as a concrete proposal on which future language designs can be discussed, and can help to guide the emerging standard of biological parts which so far has focused on biological, rather than logical, properties of parts. PMID:19369220
Towards programming languages for genetic engineering of living cells.
Pedersen, Michael; Phillips, Andrew
2009-08-06
Synthetic biology aims at producing novel biological systems to carry out some desired and well-defined functions. An ultimate dream is to design these systems at a high level of abstraction using engineering-based tools and programming languages, press a button, and have the design translated to DNA sequences that can be synthesized and put to work in living cells. We introduce such a programming language, which allows logical interactions between potentially undetermined proteins and genes to be expressed in a modular manner. Programs can be translated by a compiler into sequences of standard biological parts, a process that relies on logic programming and prototype databases that contain known biological parts and protein interactions. Programs can also be translated to reactions, allowing simulations to be carried out. While current limitations on available data prevent full use of the language in practical applications, the language can be used to develop formal models of synthetic systems, which are otherwise often presented by informal notations. The language can also serve as a concrete proposal on which future language designs can be discussed, and can help to guide the emerging standard of biological parts which so far has focused on biological, rather than logical, properties of parts.
Composite Structural Motifs of Binding Sites for Delineating Biological Functions of Proteins
Kinjo, Akira R.; Nakamura, Haruki
2012-01-01
Most biological processes are described as a series of interactions between proteins and other molecules, and interactions are in turn described in terms of atomic structures. To annotate protein functions as sets of interaction states at atomic resolution, and thereby to better understand the relation between protein interactions and biological functions, we conducted exhaustive all-against-all atomic structure comparisons of all known binding sites for ligands including small molecules, proteins and nucleic acids, and identified recurring elementary motifs. By integrating the elementary motifs associated with each subunit, we defined composite motifs that represent context-dependent combinations of elementary motifs. It is demonstrated that function similarity can be better inferred from composite motif similarity compared to the similarity of protein sequences or of individual binding sites. By integrating the composite motifs associated with each protein function, we define meta-composite motifs each of which is regarded as a time-independent diagrammatic representation of a biological process. It is shown that meta-composite motifs provide richer annotations of biological processes than sequence clusters. The present results serve as a basis for bridging atomic structures to higher-order biological phenomena by classification and integration of binding site structures. PMID:22347478
Datasets for evolutionary comparative genomics
Liberles, David A
2005-01-01
Many decisions about genome sequencing projects are directed by perceived gaps in the tree of life, or towards model organisms. With the goal of a better understanding of biology through the lens of evolution, however, there are additional genomes that are worth sequencing. One such rationale for whole-genome sequencing is discussed here, along with other important strategies for understanding the phenotypic divergence of species. PMID:16086856
2013-06-28
of cuts that each fragment should be cut into so the fragments are no greater than a specific length threshold. Additionally, vector sequences and...restriction sites are attached to each fragment while ensuring the restriction sites are unique to each sequence. The vector sequences serve as hooks...for assembly into vector for cloning purposes, and also as primer binding domains for PCR ampl ification. The restriction sites are added to
Complete genome sequence of Bifidobacterium breve CECT 7263, a strain isolated from human milk.
Jiménez, Esther; Villar-Tajadura, M Antonia; Marín, María; Fontecha, Javier; Requena, Teresa; Arroyo, Rebeca; Fernández, Leónides; Rodríguez, Juan M
2012-07-01
Bifidobacterium breve is an actinobacterium frequently isolated from colonic microbiota of breastfeeding babies. Here, we report the complete and annotated genome sequence of a B. breve strain isolated from human milk, B. breve CECT 7263. The genome sequence will provide new insights into the biology of this potential probiotic organism and will allow the characterization of genes related to beneficial properties.
Meyers, Steven R; Khoo, Xiaojuan; Huang, Xin; Walsh, Elisabeth B; Grinstaff, Mark W; Kenan, Daniel J
2009-01-01
Biomaterials used in implants have traditionally been selected based on their mechanical properties, chemical stability, and biocompatibility. However, the durability and clinical efficacy of implantable biomedical devices remain limited in part due to the absence of appropriate biological interactions at the implant interface and the lack of integration into adjacent tissues. Herein, we describe a robust peptide-based coating technology capable of modifying the surface of existing biomaterials and medical devices through the non-covalent binding of modular biofunctional peptides. These peptides contain at least one material binding sequence and at least one biologically active sequence and thus are termed, "Interfacial Biomaterials" (IFBMs). IFBMs can simultaneously bind the biomaterial surface while endowing it with desired biological functionalities at the interface between the material and biological realms. We demonstrate the capabilities of model IFBMs to convert native polystyrene, a bioinert surface, into a bioactive surface that can support a range of cell activities. We further distinguish between simple cell attachment with insufficient integrin interactions, which in some cases can adversely impact downstream biology, versus biologically appropriate adhesion, cell spreading, and cell survival mediated by IFBMs. Moreover, we show that we can use the coating technology to create spatially resolved patterns of fluorophores and cells on substrates and that these patterns retain their borders in culture.
Novel Immune Modulating Cellular Vaccine for Prostate Cancer
2014-10-01
restriction sites. Murine PSMA : The cDNA encoding mPSMA was purchased from Sino Biologicals and was cloned into the HindIII and BamHI sites of pSP73-Sph/A64...sequence) and reverse primer 5’-TATATAGAGCTCTCAGATGTTCCGATACACATCTC-3’ Murine PSMA no signal sequence (mPSMA-SS): Murine PSMA minus the signal sequence...contains a HindIII site for cloning and utilizes an ATG that lies downstream of the signal sequence as the start codon in PSMA -SS ( PSMA without signal
Plasmodium vivax Biology: Insights Provided by Genomics, Transcriptomics and Proteomics
Bourgard, Catarina; Albrecht, Letusa; Kayano, Ana C. A. V.; Sunnerhagen, Per; Costa, Fabio T. M.
2018-01-01
During the last decade, the vast omics field has revolutionized biological research, especially the genomics, transcriptomics and proteomics branches, as technological tools become available to the field researcher and allow difficult question-driven studies to be addressed. Parasitology has greatly benefited from next generation sequencing (NGS) projects, which have resulted in a broadened comprehension of basic parasite molecular biology, ecology and epidemiology. Malariology is one example where application of this technology has greatly contributed to a better understanding of Plasmodium spp. biology and host-parasite interactions. Among the several parasite species that cause human malaria, the neglected Plasmodium vivax presents great research challenges, as in vitro culturing is not yet feasible and functional assays are heavily limited. Therefore, there are gaps in our P. vivax biology knowledge that affect decisions for control policies aiming to eradicate vivax malaria in the near future. In this review, we provide a snapshot of key discoveries already achieved in P. vivax sequencing projects, focusing on developments, hurdles, and limitations currently faced by the research community, as well as perspectives on future vivax malaria research. PMID:29473024
Although recent technological advances in DNA sequencing and computational biology now allow scientists to compare entire microbial genomes, comparisons of closely related bacterial species and individual isolates by whole-genome sequencing approaches remains prohibitively expens...
Mapping and Sequencing the Human Genome
DOE R&D Accomplishments Database
1988-01-01
Numerous meetings have been held and a debate has developed in the biological community over the merits of mapping and sequencing the human genome. In response a committee to examine the desirability and feasibility of mapping and sequencing the human genome was formed to suggest options for implementing the project. The committee asked many questions. Should the analysis of the human genome be left entirely to the traditionally uncoordinated, but highly successful, support systems that fund the vast majority of biomedical research. Or should a more focused and coordinated additional support system be developed that is limited to encouraging and facilitating the mapping and eventual sequencing of the human genome. If so, how can this be done without distorting the broader goals of biological research that are crucial for any understanding of the data generated in such a human genome project. As the committee became better informed on the many relevant issues, the opinions of its members coalesced, producing a shared consensus of what should be done. This report reflects that consensus.
2014-01-01
Ferns are the only major lineage of vascular plants not represented by a sequenced nuclear genome. This lack of genome sequence information significantly impedes our ability to understand and reconstruct genome evolution not only in ferns, but across all land plants. Azolla and Ceratopteris are ideal and complementary candidates to be the first ferns to have their nuclear genomes sequenced. They differ dramatically in genome size, life history, and habit, and thus represent the immense diversity of extant ferns. Together, this pair of genomes will facilitate myriad large-scale comparative analyses across ferns and all land plants. Here we review the unique biological characteristics of ferns and describe a number of outstanding questions in plant biology that will benefit from the addition of ferns to the set of taxa with sequenced nuclear genomes. We explain why the fern clade is pivotal for understanding genome evolution across land plants, and we provide a rationale for how knowledge of fern genomes will enable progress in research beyond the ferns themselves. PMID:25324969
Sequence co-evolution gives 3D contacts and structures of protein complexes
Hopf, Thomas A; Schärfe, Charlotta P I; Rodrigues, João P G L M; Green, Anna G; Kohlbacher, Oliver; Sander, Chris; Bonvin, Alexandre M J J; Marks, Debora S
2014-01-01
Protein–protein interactions are fundamental to many biological processes. Experimental screens have identified tens of thousands of interactions, and structural biology has provided detailed functional insight for select 3D protein complexes. An alternative rich source of information about protein interactions is the evolutionary sequence record. Building on earlier work, we show that analysis of correlated evolutionary sequence changes across proteins identifies residues that are close in space with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We evaluate prediction performance in blinded tests on 76 complexes of known 3D structure, predict protein–protein contacts in 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be used to distinguish between interacting and non-interacting protein pairs in a large complex. With the current growth of sequences, we expect that the method can be generalized to genome-wide elucidation of protein–protein interaction networks and used for interaction predictions at residue resolution. DOI: http://dx.doi.org/10.7554/eLife.03430.001 PMID:25255213
NASA Astrophysics Data System (ADS)
Beeman-Cadwallader, Nicole
In 2007 Pioneer High School, a public school in Whittier, California changed the sequence of its science courses from the Traditional Biology-Chemistry-Physics (B-C-P) to Biology-Physics-Chemistry (B-P-C), or "Physics Second." The California Standards Tests (CSTs) scores in Physics and Chemistry from 2004-2012 were used to determine if there were any effects of the Physics Second sequencing on student achievement in those courses. The data was also used to determine whether the Physics Second sequence had an effect on performance in Physics and Chemistry based on gender. Independent t tests and chi-square analysis of the data determined an improvement in student performance in Chemistry but not Physics. The 2x2 Factorial ANOVA analysis revealed that in Physics male students performed better on the CSTs than their female peers. In Chemistry, it was noted that male and female students performed equally well. Neither finding was a result ofthe change to the "Physics Second" sequencing.
RaptorX server: a resource for template-based protein structure modeling.
Källberg, Morten; Margaryan, Gohar; Wang, Sheng; Ma, Jianzhu; Xu, Jinbo
2014-01-01
Assigning functional properties to a newly discovered protein is a key challenge in modern biology. To this end, computational modeling of the three-dimensional atomic arrangement of the amino acid chain is often crucial in determining the role of the protein in biological processes. We present a community-wide web-based protocol, RaptorX server ( http://raptorx.uchicago.edu ), for automated protein secondary structure prediction, template-based tertiary structure modeling, and probabilistic alignment sampling.Given a target sequence, RaptorX server is able to detect even remotely related template sequences by means of a novel nonlinear context-specific alignment potential and probabilistic consistency algorithm. Using the protocol presented here it is thus possible to obtain high-quality structural models for many target protein sequences when only distantly related protein domains have experimentally solved structures. At present, RaptorX server can perform secondary and tertiary structure prediction of a 200 amino acid target sequence in approximately 30 min.
75 FR 10507 - Advisory Committee for Biological Sciences; Notice of Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2010-03-08
....: Introductions and Updates, Presentation and Discussion-- 2011 Budget Report; Undergraduate Education; Collections; and Dimensions of Biodiversity. p.m.: Presentation and Discussion--The Future of Biology; Advances in Sequencing Technology; COV Report; Committee Discussion. March 18, 2010 a.m.: Presentation and...
Goodbye to 'one by one' genetics
Theologis, Athanasios
2001-01-01
The completion of the Arabidopsis thaliana (mustard weed) genome sequence constitutes a major breakthrough in plant biology. It will revolutionize how we answer questions about the biology and evolution of plants as well as how we confront and resolve world-wide agricultural problems. PMID:11305933
Nair, Pradeep S; John, Eugene B
2007-01-01
Aligning specific sequences against a very large number of other sequences is a central aspect of bioinformatics. With the widespread availability of personal computers in biology laboratories, sequence alignment is now often performed locally. This makes it necessary to analyse the performance of personal computers for sequence aligning bioinformatics benchmarks. In this paper, we analyse the performance of a personal computer for the popular BLAST and FASTA sequence alignment suites. Results indicate that these benchmarks have a large number of recurring operations and use memory operations extensively. It seems that the performance can be improved with a bigger L1-cache.
Design and Analysis of Single-Cell Sequencing Experiments.
Grün, Dominic; van Oudenaarden, Alexander
2015-11-05
Recent advances in single-cell sequencing hold great potential for exploring biological systems with unprecedented resolution. Sequencing the genome of individual cells can reveal somatic mutations and allows the investigation of clonal dynamics. Single-cell transcriptome sequencing can elucidate the cell type composition of a sample. However, single-cell sequencing comes with major technical challenges and yields complex data output. In this Primer, we provide an overview of available methods and discuss experimental design and single-cell data analysis. We hope that these guidelines will enable a growing number of researchers to leverage the power of single-cell sequencing. Copyright © 2015 Elsevier Inc. All rights reserved.
Implementing Genome-Driven Oncology
Hyman, David M.; Taylor, Barry S.; Baselga, José
2017-01-01
Early successes in identifying and targeting individual oncogenic drivers, together with the increasing feasibility of sequencing tumor genomes, have brought forth the promise of genome-driven oncology care. As we expand the breadth and depth of genomic analyses, the biological and clinical complexity of its implementation will be unparalleled. Challenges include target credentialing and validation, implementing drug combinations, clinical trial designs, targeting tumor heterogeneity, and deploying technologies beyond DNA sequencing, among others. We review how contemporary approaches are tackling these challenges and will ultimately serve as an engine for biological discovery and increase our insight into cancer and its treatment. PMID:28187282
[GNU Pattern: open source pattern hunter for biological sequences based on SPLASH algorithm].
Xu, Ying; Li, Yi-xue; Kong, Xiang-yin
2005-06-01
To construct a high performance open source software engine based on IBM SPLASH algorithm for later research on pattern discovery. Gpat, which is based on SPLASH algorithm, was developed by using open source software. GNU Pattern (Gpat) software was developped, which efficiently implemented the core part of SPLASH algorithm. Full source code of Gpat was also available for other researchers to modify the program under the GNU license. Gpat is a successful implementation of SPLASH algorithm and can be used as a basic framework for later research on pattern recognition in biological sequences.
NASA Technical Reports Server (NTRS)
Dayhoff, M. O.
1971-01-01
The amino acid sequences of proteins from living organisms are dealt with. The structure of proteins is first discussed; the variation in this structure from one biological group to another is illustrated by the first halves of the sequences of cytochrome c, and a phylogenetic tree is derived from the cytochrome c data. The relative geological times associated with the events of this tree are discussed. Errors which occur in the duplication of cells during the evolutionary process are examined. Particular attention is given to evolution of mutant proteins, globins, ferredoxin, and transfer ribonucleic acids (tRNA's). Finally, a general outline of biological evolution is presented.
Automated selection of synthetic biology parts for genetic regulatory networks.
Yaman, Fusun; Bhatia, Swapnil; Adler, Aaron; Densmore, Douglas; Beal, Jacob
2012-08-17
Raising the level of abstraction for synthetic biology design requires solving several challenging problems, including mapping abstract designs to DNA sequences. In this paper we present the first formalism and algorithms to address this problem. The key steps of this transformation are feature matching, signal matching, and part matching. Feature matching ensures that the mapping satisfies the regulatory relationships in the abstract design. Signal matching ensures that the expression levels of functional units are compatible. Finally, part matching finds a DNA part sequence that can implement the design. Our software tool MatchMaker implements these three steps.
Reading biological processes from nucleotide sequences
NASA Astrophysics Data System (ADS)
Murugan, Anand
Cellular processes have traditionally been investigated by techniques of imaging and biochemical analysis of the molecules involved. The recent rapid progress in our ability to manipulate and read nucleic acid sequences gives us direct access to the genetic information that directs and constrains biological processes. While sequence data is being used widely to investigate genotype-phenotype relationships and population structure, here we use sequencing to understand biophysical mechanisms. We present work on two different systems. First, in chapter 2, we characterize the stochastic genetic editing mechanism that produces diverse T-cell receptors in the human immune system. We do this by inferring statistical distributions of the underlying biochemical events that generate T-cell receptor coding sequences from the statistics of the observed sequences. This inferred model quantitatively describes the potential repertoire of T-cell receptors that can be produced by an individual, providing insight into its potential diversity and the probability of generation of any specific T-cell receptor. Then in chapter 3, we present work on understanding the functioning of regulatory DNA sequences in both prokaryotes and eukaryotes. Here we use experiments that measure the transcriptional activity of large libraries of mutagenized promoters and enhancers and infer models of the sequence-function relationship from this data. For the bacterial promoter, we infer a physically motivated 'thermodynamic' model of the interaction of DNA-binding proteins and RNA polymerase determining the transcription rate of the downstream gene. For the eukaryotic enhancers, we infer heuristic models of the sequence-function relationship and use these models to find synthetic enhancer sequences that optimize inducibility of expression. Both projects demonstrate the utility of sequence information in conjunction with sophisticated statistical inference techniques for dissecting underlying biophysical mechanisms.
2010-01-01
Background Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. Results We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. Conclusions We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch. PMID:20682041
Smith, R F; Wiese, B A; Wojzynski, M K; Davison, D B; Worley, K C
1996-05-01
The BCM Search Launcher is an integrated set of World Wide Web (WWW) pages that organize molecular biology-related search and analysis services available on the WWW by function, and provide a single point of entry for related searches. The Protein Sequence Search Page, for example, provides a single sequence entry form for submitting sequences to WWW servers that offer remote access to a variety of different protein sequence search tools, including BLAST, FASTA, Smith-Waterman, BEAUTY, PROSITE, and BLOCKS searches. Other Launch pages provide access to (1) nucleic acid sequence searches, (2) multiple and pair-wise sequence alignments, (3) gene feature searches, (4) protein secondary structure prediction, and (5) miscellaneous sequence utilities (e.g., six-frame translation). The BCM Search Launcher also provides a mechanism to extend the utility of other WWW services by adding supplementary hypertext links to results returned by remote servers. For example, links to the NCBI's Entrez data base and to the Sequence Retrieval System (SRS) are added to search results returned by the NCBI's WWW BLAST server. These links provide easy access to auxiliary information, such as Medline abstracts, that can be extremely helpful when analyzing BLAST data base hits. For new or infrequent users of sequence data base search tools, we have preset the default search parameters to provide the most informative first-pass sequence analysis possible. We have also developed a batch client interface for Unix and Macintosh computers that allows multiple input sequences to be searched automatically as a background task, with the results returned as individual HTML documents directly to the user's system. The BCM Search Launcher and batch client are available on the WWW at URL http:@gc.bcm.tmc.edu:8088/search-launcher.html.
Emerging Tools for Synthetic Genome Design
Lee, Bo-Rahm; Cho, Suhyung; Song, Yoseb; Kim, Sun Chang; Cho, Byung-Kwan
2013-01-01
Synthetic biology is an emerging discipline for designing and synthesizing predictable, measurable, controllable, and transformable biological systems. These newly designed biological systems have great potential for the development of cheaper drugs, green fuels, biodegradable plastics, and targeted cancer therapies over the coming years. Fortunately, our ability to quickly and accurately engineer biological systems that behave predictably has been dramatically expanded by significant advances in DNA-sequencing, DNA-synthesis, and DNA-editing technologies. Here, we review emerging technologies and methodologies in the field of building designed biological systems, and we discuss their future perspectives. PMID:23708771
Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke
2009-02-15
Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/. (c) 2008 Wiley-Liss, Inc.
Mykles, Donald L; Burnett, Karen G; Durica, David S; Stillman, Jonathon H
2016-12-01
Crustaceans, and decapods in particular (i.e., crabs, shrimp, and lobsters), are a diverse and ecologically and commercially important group of organisms. Understanding responses to abiotic and biotic factors is critical for developing best practices in aquaculture and assessing the effects of changing environments on the biology of these important animals. A relatively small number of decapod crustacean species have been intensively studied at the molecular level; the availability, experimental tractability, and economic relevance factor into the selection of a particular species as a model. Transcriptomics, using high-throughput next generation sequencing (NGS, coupled with RNA sequencing or RNA-seq) is revolutionizing crustacean biology. The 11 symposium papers in this volume illustrate how RNA-seq is being used to study stress response, molting and limb regeneration, immunity and disease, reproduction and development, neurobiology, and ecology and evolution. This symposium occurred on the 10th anniversary of the symposium, "Genomic and Proteomic Approaches to Crustacean Biology", held at the Society for Integrative and Comparative Biology 2006 meeting. Two participants in the 2006 symposium, the late Paul Gross and David Towle, were recognized as leaders who pioneered the use of molecular techniques that would ultimately foster the transcriptomics research reviewed in this volume. RNA-seq is a powerful tool for hypothesis-driven research, as well as an engine for discovery. It has eclipsed the technologies available in 2006, such as microarrays, expressed sequence tags, and subtractive hybridization screening, as the millions of "reads" from NGS enable researchers to de novo assemble a comprehensive transcriptome without a complete genome sequence. The symposium series concludes with a policy paper that gives an overview of the resources available and makes recommendations for developing better tools for functional annotation and pathway and network analysis in organisms in which the genome is not available or is incomplete. © The Author 2016. Published by Oxford University Press on behalf of the Society for Integrative and Comparative Biology. All rights reserved. For permissions please email: journals.permissions@oup.com.
ISOL@: an Italian SOLAnaceae genomics resource.
Chiusano, Maria Luisa; D'Agostino, Nunzio; Traini, Alessandra; Licciardello, Concetta; Raimondo, Enrico; Aversano, Mario; Frusciante, Luigi; Monti, Luigi
2008-03-26
Present-day '-omics' technologies produce overwhelming amounts of data which include genome sequences, information on gene expression (transcripts and proteins) and on cell metabolic status. These data represent multiple aspects of a biological system and need to be investigated as a whole to shed light on the mechanisms which underpin the system functionality. The gathering and convergence of data generated by high-throughput technologies, the effective integration of different data-sources and the analysis of the information content based on comparative approaches are key methods for meaningful biological interpretations. In the frame of the International Solanaceae Genome Project, we propose here ISOLA, an Italian SOLAnaceae genomics resource. ISOLA (available at http://biosrv.cab.unina.it/isola) represents a trial platform and it is conceived as a multi-level computational environment.ISOLA currently consists of two main levels: the genome and the expression level. The cornerstone of the genome level is represented by the Solanum lycopersicum genome draft sequences generated by the International Tomato Genome Sequencing Consortium. Instead, the basic element of the expression level is the transcriptome information from different Solanaceae species, mainly in the form of species-specific comprehensive collections of Expressed Sequence Tags (ESTs). The cross-talk between the genome and the expression levels is based on data source sharing and on tools that enhance data quality, that extract information content from the levels' under parts and produce value-added biological knowledge. ISOLA is the result of a bioinformatics effort that addresses the challenges of the post-genomics era. It is designed to exploit '-omics' data based on effective integration to acquire biological knowledge and to approach a systems biology view. Beyond providing experimental biologists with a preliminary annotation of the tomato genome, this effort aims to produce a trial computational environment where different aspects and details are maintained as they are relevant for the analysis of the organization, the functionality and the evolution of the Solanaceae family.
Finding Sequences for over 270 Orphan Enzymes
Shearer, Alexander G.; Altman, Tomer; Rhee, Christine D.
2014-01-01
Despite advances in sequencing technology, there are still significant numbers of well-characterized enzymatic activities for which there are no known associated sequences. These ‘orphan enzymes’ represent glaring holes in our biological understanding, and it is a top priority to reunite them with their coding sequences. Here we report a methodology for resolving orphan enzymes through a combination of database search and literature review. Using this method we were able to reconnect over 270 orphan enzymes with their corresponding sequence. This success points toward how we can systematically eliminate the remaining orphan enzymes and prevent the introduction of future orphan enzymes. PMID:24826896
Nielsen, Morten; Andreatta, Massimo
2017-07-03
Peptides are extensively used to characterize functional or (linear) structural aspects of receptor-ligand interactions in biological systems, e.g. SH2, SH3, PDZ peptide-recognition domains, the MHC membrane receptors and enzymes such as kinases and phosphatases. NNAlign is a method for the identification of such linear motifs in biological sequences. The algorithm aligns the amino acid or nucleotide sequences provided as training set, and generates a model of the sequence motif detected in the data. The webserver allows setting up cross-validation experiments to estimate the performance of the model, as well as evaluations on independent data. Many features of the training sequences can be encoded as input, and the network architecture is highly customizable. The results returned by the server include a graphical representation of the motif identified by the method, performance values and a downloadable model that can be applied to scan protein sequences for occurrence of the motif. While its performance for the characterization of peptide-MHC interactions is widely documented, we extended NNAlign to be applicable to other receptor-ligand systems as well. Version 2.0 supports alignments with insertions and deletions, encoding of receptor pseudo-sequences, and custom alphabets for the training sequences. The server is available at http://www.cbs.dtu.dk/services/NNAlign-2.0. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Aziz, Ramy K.; Dwivedi, Bhakti; Akhter, Sajia; Breitbart, Mya; Edwards, Robert A.
2015-01-01
Phages are the most abundant biological entities on Earth and play major ecological roles, yet the current sequenced phage genomes do not adequately represent their diversity, and little is known about the abundance and distribution of these sequenced genomes in nature. Although the study of phage ecology has benefited tremendously from the emergence of metagenomic sequencing, a systematic survey of phage genes and genomes in various ecosystems is still lacking, and fundamental questions about phage biology, lifestyle, and ecology remain unanswered. To address these questions and improve comparative analysis of phages in different metagenomes, we screened a core set of publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator. We then adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems. Experiments using those metrics individually showed their usefulness in emphasizing the pervasive, yet uneven, distribution of known phage sequences in environmental metagenomes. Using these metrics in combination allowed us to resolve phage genomes into clusters that correlated with their genotypes and taxonomic classes as well as their ecological properties. We propose adding this set of metrics to current metaviromic analysis pipelines, where they can provide insight regarding phage mosaicism, habitat specificity, and evolution. PMID:26005436
Aziz, Ramy K.; Dwivedi, Bhakti; Akhter, Sajia; ...
2015-05-08
Phages are the most abundant biological entities on Earth and play major ecological roles, yet the current sequenced phage genomes do not adequately represent their diversity, and little is known about the abundance and distribution of these sequenced genomes in nature. Although the study of phage ecology has benefited tremendously from the emergence of metagenomic sequencing, a systematic survey of phage genes and genomes in various ecosystems is still lacking, and fundamental questions about phage biology, lifestyle, and ecology remain unanswered. To address these questions and improve comparative analysis of phages in different metagenomes, we screened a core set ofmore » publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator. We then adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems. Experiments using those metrics individually showed their usefulness in emphasizing the pervasive, yet uneven, distribution of known phage sequences in environmental metagenomes. Using these metrics in combination allowed us to resolve phage genomes into clusters that correlated with their genotypes and taxonomic classes as well as their ecological properties. By adding this set of metrics to current metaviromic analysis pipelines, where they can provide insight regarding phage mosaicism, habitat specificity, and evolution.« less
ERGC: an efficient referential genome compression algorithm
Saha, Subrata; Rajasekaran, Sanguthevar
2015-01-01
Motivation: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression. Results: We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising. Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/ERGC.zip. Contact: rajasek@engr.uconn.edu PMID:26139636
Efficient use of unlabeled data for protein sequence classification: a comparative study
Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir
2009-01-01
Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450
Score distributions of gapped multiple sequence alignments down to the low-probability tail
NASA Astrophysics Data System (ADS)
Fieth, Pascal; Hartmann, Alexander K.
2016-08-01
Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to multiple sequence alignments with gaps, which are much more relevant for practical applications in molecular biology. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10-160, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aziz, Ramy K.; Dwivedi, Bhakti; Akhter, Sajia
Phages are the most abundant biological entities on Earth and play major ecological roles, yet the current sequenced phage genomes do not adequately represent their diversity, and little is known about the abundance and distribution of these sequenced genomes in nature. Although the study of phage ecology has benefited tremendously from the emergence of metagenomic sequencing, a systematic survey of phage genes and genomes in various ecosystems is still lacking, and fundamental questions about phage biology, lifestyle, and ecology remain unanswered. To address these questions and improve comparative analysis of phages in different metagenomes, we screened a core set ofmore » publicly available metagenomic samples for sequences related to completely sequenced phages using the web tool, Phage Eco-Locator. We then adopted and deployed an array of mathematical and statistical metrics for a multidimensional estimation of the abundance and distribution of phage genes and genomes in various ecosystems. Experiments using those metrics individually showed their usefulness in emphasizing the pervasive, yet uneven, distribution of known phage sequences in environmental metagenomes. Using these metrics in combination allowed us to resolve phage genomes into clusters that correlated with their genotypes and taxonomic classes as well as their ecological properties. By adding this set of metrics to current metaviromic analysis pipelines, where they can provide insight regarding phage mosaicism, habitat specificity, and evolution.« less
Molecular taxonomic techniques such as DNA barcoding offer interesting new capabilities for studying community biodiversity for applications like biological monitoring. Beyond DNA barcoding, new DNA sequencing technologies (i.e. Next-Generation Sequencing) present even greater po...
Middle Level SS&C Energy Series.
ERIC Educational Resources Information Center
Crow, Linda W.; Aldridge, Bill G.
The project on Scope Sequence and Coordination of Secondary School Science (SS&C) was initiated by the National Science Teachers Association (NSTA) and recommends that all students study science every year and advocates carefully sequenced, well-coordinated instruction in biology, chemistry, earth/space science, and physics. This document…
Tree crops: Advances in insects and disease management
USDA-ARS?s Scientific Manuscript database
Advances in next-generation sequencing have enabled genome sequencing to be fast and affordable. Thus today researchers and industries can address new methods in pest and pathogen management. Biological control of insect pests that occur in large areas, such as forests and farming systems of fruit t...
Combining partially ranked data in plant breeding and biology: II. Analysis with Rasch model.
USDA-ARS?s Scientific Manuscript database
Many years of breeding experiments, germplasm screening, and molecular biologic experimentation have generated volumes of sequence, genotype, and phenotype information that have been stored in public data repositories. These resources afford genetic and genomic researchers the opportunity to handle ...
Tipping the Balance: Hepatotoxicity and the Four Apical Key Events of Hepatic Steatosis
Adverse outcome pathways (AOPs) are descriptive biological sequences that start from a molecular initiating event (MIE) and end with an adverse health outcome. AOPs provide biological context for high throughput chemical testing and further prioritize environmental health risk r...
The development of current biological monitoring and bioassessment programs was a drastic improvement over previous programs created for monitoring a limited number of specific chemical pollutants. Although these assessment programs are better designed to address the transient an...
Emergence of biological organization through thermodynamic inversion.
Kompanichenko, Vladimir
2014-01-01
Biological organization arises under thermodynamic inversion in prebiotic systems that provide the prevalence of free energy and information contribution over the entropy contribution. The inversion might occur under specific far-from-equilibrium conditions in prebiotic systems oscillating around the bifurcation point. At the inversion moment, (physical) information characteristic of non-biological systems acquires the new features: functionality, purposefulness, and control over the life processes, which transform it into biological information. Random sequences of amino acids and nucleotides, spontaneously synthesized in the prebiotic microsystem, in the primary living unit (probiont) re-assemble into functional sequences, involved into bioinformation circulation through nucleoprotein interaction, resulted in the genetic code emergence. According to the proposed concept, oscillating three-dimensional prebiotic microsystems transformed into probionts in the changeable hydrothermal medium of the early Earth. The inversion concept states that spontaneous (accidental, random) transformations in prebiotic systems cannot produce life; it is only non-spontaneous (perspective, purposeful) transformations, which are the result of thermodynamic inversion, that lead to the negentropy conversion of prebiotic systems into initial living units.
Hammond, R W; Crosslin, J M; Pasini, R; Howell, W E; Mink, G I
1999-07-01
Prunus necrotic ringspot ilarvirus (PNRSV) exists as a number of biologically distinct variants which differ in host specificity, serology, and pathology. Previous nucleotide sequence alignment and phylogenetic analysis of cloned reverse transcription-polymerase chain reaction (RT-PCR) products of several biologically distinct sweet cherry isolates revealed correlations between symptom type and the nucleotide and amino acid sequences of the 3a (putative movement protein) and 3b (coat protein) open reading frames. Based upon this analysis, RT-PCR assays have been developed that can identify isolates displaying different symptoms and serotypes. The incorporation of primers in a multiplex PCR protocol permits rapid detection and discrimination among the strains. The results of PCR amplification using type-specific primers that amplify a portion of the coat protein gene demonstrate that the primer-selection procedure developed for PNRSV constitutes a reliable method of viral strain discrimination in cherry for disease control and will also be useful for examining biological diversity within the PNRSV virus group.
Liu, Yindong; Su, Xiaomei; Lu, Lian; Ding, Linxian; Shen, Chaofeng
2016-03-01
A culture supernatant from Micrococcus luteus containing resuscitation-promoting factor (SRpf) was used to enhance the biological nutrient removal of potentially functional bacteria. The obtained results suggest that SRpf accelerated the start-up process and significantly enhanced the biological nutrient removal in sequencing batch reactor (SBR). PO4 (3-)-P removal efficiency increased by over 12 % and total nitrogen removal efficiency increased by over 8 % in treatment reactor acclimated by SRpf compared with those without SRpf addition. The Illumina high-throughput sequencing analysis showed that SRpf played an essential role in shifts in the composition and diversity of bacterial community. The phyla of Proteobacteria and Actinobacteria, which were closely related to biological nutrient removal, were greatly abundant after SRpf addition. This study demonstrates that SRpf acclimation or addition might hold great potential as an efficient and cost-effective alternative for wastewater treatment plants (WWTPs) to meet more stringent operation conditions and legislations.
Orthologs and paralogs - we need to get it right
Jensen, Roy A
2001-01-01
A response to Homologuephobia, by Gregory A Petsko, Genome Biology 2001 2:comment1002.1-1002.2, to An apology for orthologs - or brave new memes by Eugene V Koonin, Genome Biology 2001, 2:comment1005.1-1005.2, and to Can sequence determine function? by John A Gerlt and Patricia C Babbitt, Genome Biology 2000, 1:reviews0005.1-0005.10. PMID:11532207
2005-10-01
1 U.S. Army Soldier and Biological Chemical Command Genotyping of Burkholderia mallei : Effective Subspecies Discrimination using Ribotyping and...number. 1. REPORT DATE 01 OCT 2005 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE Genotyping of Burkholderia mallei : Effective...34Also can involve physical, chemical or immunological characteristics 3 Edgewood Chemical Biological Center Burkholderia mallei Biology and Pathogenesis
Complete Genome Sequence of Bifidobacterium breve CECT 7263, a Strain Isolated from Human Milk
Jiménez, Esther; Villar-Tajadura, M. Antonia; Marín, María; Fontecha, Javier; Requena, Teresa; Arroyo, Rebeca; Fernández, Leónides
2012-01-01
Bifidobacterium breve is an actinobacterium frequently isolated from colonic microbiota of breastfeeding babies. Here, we report the complete and annotated genome sequence of a B. breve strain isolated from human milk, B. breve CECT 7263. The genome sequence will provide new insights into the biology of this potential probiotic organism and will allow the characterization of genes related to beneficial properties. PMID:22740680
EST-PAC a web package for EST annotation and protein sequence prediction
Strahm, Yvan; Powell, David; Lefèvre, Christophe
2006-01-01
With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics. PMID:17147782
2014-01-01
Background Ambiscript is a graphically-designed nucleic acid notation that uses symbol symmetries to support sequence complementation, highlight biologically-relevant palindromes, and facilitate the analysis of consensus sequences. Although the original Ambiscript notation was designed to easily represent consensus sequences for multiple sequence alignments, the notation’s black-on-white ambiguity characters are unable to reflect the statistical distribution of nucleotides found at each position. We now propose a color-augmented ambigraphic notation to encode the frequency of positional polymorphisms in these consensus sequences. Results We have implemented this color-coding approach by creating an Adobe Flash® application ( http://www.ambiscript.org) that shades and colors modified Ambiscript characters according to the prevalence of the encoded nucleotide at each position in the alignment. The resulting graphic helps viewers perceive biologically-relevant patterns in multiple sequence alignments by uniquely combining color, shading, and character symmetries to highlight palindromes and inverted repeats in conserved DNA motifs. Conclusion Juxtaposing an intuitive color scheme over the deliberate character symmetries of an ambigraphic nucleic acid notation yields a highly-functional nucleic acid notation that maximizes information content and successfully embodies key principles of graphic excellence put forth by the statistician and graphic design theorist, Edward Tufte. PMID:24447494
The Genome Sequence of Taurine Cattle: A window to ruminant biology and evolution
Elsik, Christine G.; Tellam, Ross L.; Worley, Kim C.
2010-01-01
To understand the biology and evolution of ruminants, the cattle genome was sequenced to ∼7× coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1,217 are absent or undetected in non-eutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides an enabling resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production. PMID:19390049
Sequence alignment visualization in HTML5 without Java.
Gille, Christoph; Birgit, Weyand; Gille, Andreas
2014-01-01
Java has been extensively used for the visualization of biological data in the web. However, the Java runtime environment is an additional layer of software with an own set of technical problems and security risks. HTML in its new version 5 provides features that for some tasks may render Java unnecessary. Alignment-To-HTML is the first HTML-based interactive visualization for annotated multiple sequence alignments. The server side script interpreter can perform all tasks like (i) sequence retrieval, (ii) alignment computation, (iii) rendering, (iv) identification of a homologous structural models and (v) communication with BioDAS-servers. The rendered alignment can be included in web pages and is displayed in all browsers on all platforms including touch screen tablets. The functionality of the user interface is similar to legacy Java applets and includes color schemes, highlighting of conserved and variable alignment positions, row reordering by drag and drop, interlinked 3D visualization and sequence groups. Novel features are (i) support for multiple overlapping residue annotations, such as chemical modifications, single nucleotide polymorphisms and mutations, (ii) mechanisms to quickly hide residue annotations, (iii) export to MS-Word and (iv) sequence icons. Alignment-To-HTML, the first interactive alignment visualization that runs in web browsers without additional software, confirms that to some extend HTML5 is already sufficient to display complex biological data. The low speed at which programs are executed in browsers is still the main obstacle. Nevertheless, we envision an increased use of HTML and JavaScript for interactive biological software. Under GPL at: http://www.bioinformatics.org/strap/toHTML/.
An approach for identification of unknown viruses using sequencing-by-hybridization.
Katoski, Sarah E; Meyer, Hermann; Ibrahim, Sofi
2015-09-01
Accurate identification of biological threat agents, especially RNA viruses, in clinical or environmental samples can be challenging because the concentration of viral genomic material in a given sample is usually low, viral genomic RNA is liable to degradation, and RNA viruses are extremely diverse. A two-tiered approach was used for initial identification, then full genomic characterization of 199 RNA viruses belonging to virus families Arenaviridae, Bunyaviridae, Filoviridae, Flaviviridae, and Togaviridae. A Sequencing-by-hybridization (SBH) microarray was used to tentatively identify a viral pathogen then, the identity is confirmed by guided next-generation sequencing (NGS). After optimization and evaluation of the SBH and NGS methodologies with various virus species and strains, the approach was used to test the ability to identify viruses in blinded samples. The SBH correctly identified two Ebola viruses in the blinded samples within 24 hr, and by using guided amplicon sequencing with 454 GS FLX, the identities of the viruses in both samples were confirmed. SBH provides at relatively low-cost screening of biological samples against a panel of viral pathogens that can be custom-designed on a microarray. Once the identity of virus is deduced from the highest hybridization signal on the SBH microarray, guided (amplicon) NGS sequencing can be used not only to confirm the identity of the virus but also to provide further information about the strain or isolate, including a potential genetic manipulation. This approach can be useful in situations where natural or deliberate biological threat incidents might occur and a rapid response is required. © 2015 Wiley Periodicals, Inc.
Andreatta, Massimo; Schafer-Nielsen, Claus; Lund, Ole; Buus, Søren; Nielsen, Morten
2011-01-01
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign. PMID:22073191
Andreatta, Massimo; Schafer-Nielsen, Claus; Lund, Ole; Buus, Søren; Nielsen, Morten
2011-01-01
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new "omics"-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.
Present Day Biology seen in the Looking Glass of Physics of Complexity
NASA Astrophysics Data System (ADS)
Schuster, P.
Darwin's theory of variation and selection in its simplest form is directly applicable to RNA evolution in vitro as well as to virus evolution, and it allows for quantitative predictions. Understanding evolution at the molecular level is ultimately related to the central paradigm of structural biology: sequence⇒ structure ⇒ function. We elaborate on the state of the art in modeling and understanding evolution of RNA driven by reproduction and mutation. The focus will be laid on the landscape concept—originally introduced by Sewall Wright—and its application to problems in biology. The relation between genotypes and phenotypes is the result of two consecutive mappings from a space of genotypes called sequence space onto a space of phenotypes or structures, and fitness is the result of a mapping from phenotype space into non-negative real numbers. Realistic landscapes as derived from folding of RNA sequences into structures are characterized by two properties: (i) they are rugged in the sense that sequences lying nearby in sequence space may have very different fitness values and (ii) they are characterized by an appreciable degree of neutrality implying that a certain fraction of genotypes and/or phenotypes cannot be distinguished in the selection process. Evolutionary dynamics on realistic landscapes will be studied as a function of the mutation rate, and the role of neutrality in the selection process will be discussed.
A computational genomics pipeline for prokaryotic sequencing projects
Kislyuk, Andrey O.; Katz, Lee S.; Agrawal, Sonia; Hagen, Matthew S.; Conley, Andrew B.; Jayaraman, Pushkala; Nelakuditi, Viswateja; Humphrey, Jay C.; Sammons, Scott A.; Govil, Dhwani; Mair, Raydel D.; Tatti, Kathleen M.; Tondella, Maria L.; Harcourt, Brian H.; Mayer, Leonard W.; Jordan, I. King
2010-01-01
Motivation: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. Results: We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. Availability and implementation: The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems. Contact: king.jordan@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20519285
Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases
Assmus, Jens; Kleffe, Jürgen; Schmitt, Armin O.; Brockmann, Gudrun A.
2013-01-01
There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence. PMID:23658777
Hansen, Loren; Kim, Nak-Kyeong; Mariño-Ramírez, Leonardo; Landsman, David
2011-01-01
Meiotic recombination is not distributed uniformly throughout the genome. There are regions of high and low recombination rates called hot and cold spots, respectively. The recombination rate parallels the frequency of DNA double-strand breaks (DSBs) that initiate meiotic recombination. The aim is to identify biological features associated with DSB frequency. We constructed vectors representing various chromatin and sequence-based features for 1179 DSB hot spots and 1028 DSB cold spots. Using a feature selection approach, we have identified five features that distinguish hot from cold spots in Saccharomyces cerevisiae with high accuracy, namely the histone marks H3K4me3, H3K14ac, H3K36me3, and H3K79me3; and GC content. Previous studies have associated H3K4me3, H3K36me3, and GC content with areas of mitotic recombination. H3K14ac and H3K79me3 are novel predictions and thus represent good candidates for further experimental study. We also show nucleosome occupancy maps produced using next generation sequencing exhibit a bias at DSB hot spots and this bias is strong enough to obscure biologically relevant information. A computational approach using feature selection can productively be used to identify promising biological associations. H3K14ac and H3K79me3 are novel predictions of chromatin marks associated with meiotic DSBs. Next generation sequencing can exhibit a bias that is strong enough to lead to incorrect conclusions. Care must be taken when interpreting high throughput sequencing data where systematic biases have been documented. PMID:22242140
André, Nicole M.
2018-01-01
ABSTRACT The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members. PMID:29299531
Mohorianu, Irina; Stocks, Matthew Benedict; Wood, John; Dalmay, Tamas; Moulton, Vincent
2013-01-01
Small RNAs (sRNAs) are 20–25 nt non-coding RNAs that act as guides for the highly sequence-specific regulatory mechanism known as RNA silencing. Due to the recent increase in sequencing depth, a highly complex and diverse population of sRNAs in both plants and animals has been revealed. However, the exponential increase in sequencing data has also made the identification of individual sRNA transcripts corresponding to biological units (sRNA loci) more challenging when based exclusively on the genomic location of the constituent sRNAs, hindering existing approaches to identify sRNA loci. To infer the location of significant biological units, we propose an approach for sRNA loci detection called CoLIde (Co-expression based sRNA Loci Identification) that combines genomic location with the analysis of other information such as variation in expression levels (expression pattern) and size class distribution. For CoLIde, we define a locus as a union of regions sharing the same pattern and located in close proximity on the genome. Biological relevance, detected through the analysis of size class distribution, is also calculated for each locus. CoLIde can be applied on ordered (e.g., time-dependent) or un-ordered (e.g., organ, mutant) series of samples both with or without biological/technical replicates. The method reliably identifies known types of loci and shows improved performance on sequencing data from both plants (e.g., A. thaliana, S. lycopersicum) and animals (e.g., D. melanogaster) when compared with existing locus detection techniques. CoLIde is available for use within the UEA Small RNA Workbench which can be downloaded from: http://srna-workbench.cmp.uea.ac.uk. PMID:23851377
Whittaker, Gary R; André, Nicole M; Millet, Jean Kaoru
2018-01-01
The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liphardt, Jan
In April 1953, Watson and Crick largely defined the program of 20th century biology: obtaining the blueprint of life encoded in the DNA. Fifty years later, in 2003, the sequencing of the human genome was completed. Like any major scientific breakthrough, the sequencing of the human genome raised many more questions than it answered. I'll brief you on some of the big open problems in cell and developmental biology, and I'll explain why approaches, tools, and ideas from the physical sciences are currently reshaping biological research. Super-resolution light microscopies are revealing the intricate spatial organization of cells, single-molecule methods showmore » how molecular machines function, and new probes are clarifying the role of mechanical forces in cell and tissue function. At the same time, Physics stands to gain beautiful new problems in soft condensed matter, quantum mechanics, and non-equilibrium thermodynamics.« less
Automated quantitative assessment of proteins' biological function in protein knowledge bases.
Mayr, Gabriele; Lepperdinger, Günter; Lackner, Peter
2008-01-01
Primary protein sequence data are archived in databases together with information regarding corresponding biological functions. In this respect, UniProt/Swiss-Prot is currently the most comprehensive collection and it is routinely cross-examined when trying to unravel the biological role of hypothetical proteins. Bioscientists frequently extract single entries and further evaluate those on a subjective basis. In lieu of a standardized procedure for scoring the existing knowledge regarding individual proteins, we here report about a computer-assisted method, which we applied to score the present knowledge about any given Swiss-Prot entry. Applying this quantitative score allows the comparison of proteins with respect to their sequence yet highlights the comprehension of functional data. pfs analysis may be also applied for quality control of individual entries or for database management in order to rank entry listings.
BioBrick assembly standards and techniques and associated software tools.
Røkke, Gunvor; Korvald, Eirin; Pahr, Jarle; Oyås, Ove; Lale, Rahmi
2014-01-01
The BioBrick idea was developed to introduce the engineering principles of abstraction and standardization into synthetic biology. BioBricks are DNA sequences that serve a defined biological function and can be readily assembled with any other BioBrick parts to create new BioBricks with novel properties. In order to achieve this, several assembly standards can be used. Which assembly standards a BioBrick is compatible with, depends on the prefix and suffix sequences surrounding the part. In this chapter, five of the most common assembly standards will be described, as well as some of the most used assembly techniques, cloning procedures, and a presentation of the available software tools that can be used for deciding on the best method for assembling of different BioBricks, and searching for BioBrick parts in the Registry of Standard Biological Parts database.
Robarts, Daniel W H; Wolfe, Andrea D
2014-07-01
In the past few decades, many investigations in the field of plant biology have employed selectively neutral, multilocus, dominant markers such as inter-simple sequence repeat (ISSR), random-amplified polymorphic DNA (RAPD), and amplified fragment length polymorphism (AFLP) to address hypotheses at lower taxonomic levels. More recently, sequence-related amplified polymorphism (SRAP) markers have been developed, which are used to amplify coding regions of DNA with primers targeting open reading frames. These markers have proven to be robust and highly variable, on par with AFLP, and are attained through a significantly less technically demanding process. SRAP markers have been used primarily for agronomic and horticultural purposes, developing quantitative trait loci in advanced hybrids and assessing genetic diversity of large germplasm collections. Here, we suggest that SRAP markers should be employed for research addressing hypotheses in plant systematics, biogeography, conservation, ecology, and beyond. We provide an overview of the SRAP literature to date, review descriptive statistics of SRAP markers in a subset of 171 publications, and present relevant case studies to demonstrate the applicability of SRAP markers to the diverse field of plant biology. Results of these selected works indicate that SRAP markers have the potential to enhance the current suite of molecular tools in a diversity of fields by providing an easy-to-use, highly variable marker with inherent biological significance.
Robarts, Daniel W. H.; Wolfe, Andrea D.
2014-01-01
In the past few decades, many investigations in the field of plant biology have employed selectively neutral, multilocus, dominant markers such as inter-simple sequence repeat (ISSR), random-amplified polymorphic DNA (RAPD), and amplified fragment length polymorphism (AFLP) to address hypotheses at lower taxonomic levels. More recently, sequence-related amplified polymorphism (SRAP) markers have been developed, which are used to amplify coding regions of DNA with primers targeting open reading frames. These markers have proven to be robust and highly variable, on par with AFLP, and are attained through a significantly less technically demanding process. SRAP markers have been used primarily for agronomic and horticultural purposes, developing quantitative trait loci in advanced hybrids and assessing genetic diversity of large germplasm collections. Here, we suggest that SRAP markers should be employed for research addressing hypotheses in plant systematics, biogeography, conservation, ecology, and beyond. We provide an overview of the SRAP literature to date, review descriptive statistics of SRAP markers in a subset of 171 publications, and present relevant case studies to demonstrate the applicability of SRAP markers to the diverse field of plant biology. Results of these selected works indicate that SRAP markers have the potential to enhance the current suite of molecular tools in a diversity of fields by providing an easy-to-use, highly variable marker with inherent biological significance. PMID:25202637
Christen, Matthias; Del Medico, Luca; Christen, Heinz; Christen, Beat
2017-01-01
Recent advances in lower-cost DNA synthesis techniques have enabled new innovations in the field of synthetic biology. Still, efficient design and higher-order assembly of genome-scale DNA constructs remains a labor-intensive process. Given the complexity, computer assisted design tools that fragment large DNA sequences into fabricable DNA blocks are needed to pave the way towards streamlined assembly of biological systems. Here, we present the Genome Partitioner software implemented as a web-based interface that permits multi-level partitioning of genome-scale DNA designs. Without the need for specialized computing skills, biologists can submit their DNA designs to a fully automated pipeline that generates the optimal retrosynthetic route for higher-order DNA assembly. To test the algorithm, we partitioned a 783 kb Caulobacter crescentus genome design. We validated the partitioning strategy by assembling a 20 kb test segment encompassing a difficult to synthesize DNA sequence. Successful assembly from 1 kb subblocks into the 20 kb segment highlights the effectiveness of the Genome Partitioner for reducing synthesis costs and timelines for higher-order DNA assembly. The Genome Partitioner is broadly applicable to translate DNA designs into ready to order sequences that can be assembled with standardized protocols, thus offering new opportunities to harness the diversity of microbial genomes for synthetic biology applications. The Genome Partitioner web tool can be accessed at https://christenlab.ethz.ch/GenomePartitioner.
McDougall, Carmel; Woodcroft, Ben J.
2016-01-01
In nature, numerous mechanisms have evolved by which organisms fabricate biological structures with an impressive array of physical characteristics. Some examples of metazoan biological materials include the highly elastic byssal threads by which bivalves attach themselves to rocks, biomineralized structures that form the skeletons of various animals, and spider silks that are renowned for their exceptional strength and elasticity. The remarkable properties of silks, which are perhaps the best studied biological materials, are the result of the highly repetitive, modular, and biased amino acid composition of the proteins that compose them. Interestingly, similar levels of modularity/repetitiveness and similar bias in amino acid compositions have been reported in proteins that are components of structural materials in other organisms, however the exact nature and extent of this similarity, and its functional and evolutionary relevance, is unknown. Here, we investigate this similarity and use sequence features common to silks and other known structural proteins to develop a bioinformatics-based method to identify similar proteins from large-scale transcriptome and whole-genome datasets. We show that a large number of proteins identified using this method have roles in biological material formation throughout the animal kingdom. Despite the similarity in sequence characteristics, most of the silk-like structural proteins (SLSPs) identified in this study appear to have evolved independently and are restricted to a particular animal lineage. Although the exact function of many of these SLSPs is unknown, the apparent independent evolution of proteins with similar sequence characteristics in divergent lineages suggests that these features are important for the assembly of biological materials. The identification of these characteristics enable the generation of testable hypotheses regarding the mechanisms by which these proteins assemble and direct the construction of biological materials with diverse morphologies. The SilkSlider predictor software developed here is available at https://github.com/wwood/SilkSlider. PMID:27415783
DOE Office of Scientific and Technical Information (OSTI.GOV)
Khmaladze, Ekaterine; Dzavashvili, Giorgi; Chanturia, Gvantsa
Bacillus anthracis causes the acute fatal disease anthrax, is a proven biological weapon, and is endemic in Georgia, where human and animal cases are reported annually. Furthermore, we present whole-genome sequences of 10 historical B. anthracis strains from Georgia.
Regulatory sequence of cupin family gene
Hood, Elizabeth; Teoh, Thomas
2017-07-25
This invention is in the field of plant biology and agriculture and relates to novel seed specific promoter regions. The present invention further provide methods of producing proteins and other products of interest and methods of controlling expression of nucleic acid sequences of interest using the seed specific promoter regions.
Chemical and radiation mutagenesis: Induction and detection by whole genome sequencing
USDA-ARS?s Scientific Manuscript database
Brachypodium distachyon has emerged as an effective model system to address fundamental questions in grass biology. With its small sequenced genome, short generation time and rapidly expanding array of genetic tools B. distachyon is an ideal system to elucidate the molecular basis of important trai...
Prospects: the tomato genome as a cornerstone for gene discovery
USDA-ARS?s Scientific Manuscript database
Those involved in the international tomato genome sequencing effort contributed to not only the development of an important genome sequence relevant to a major economic and nutritional crop, but also to the tomato experimental system as a model for plant biology. Without question, prior seminal work...
USDA-ARS?s Scientific Manuscript database
Genome evolution influences a parasite’s’s pathogenicity, host-pathogen interactions, environmental constraints, and invasion biology, while genome assemblies form the basis of comparative sequence analyses. Given that closely related organisms typically maintain appreciable synteny, the genome asse...
Practical applications of next-generation sequencing for food-safety research
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing (NGS) is a transformative technology that is revolutionizing the biological sciences. However, many researchers remain uncertain as to the best ways to harness the power of NGS and apply it to their own research questions. Here we highlight three case studies of how NGS ...
Genome sequencing efforts in the past decade were aimed at generating draft sequences of many prokaryotic and eukaryotic model organisms. Successful completion of unicellular eukaryotes, worm, fly and human genome have opened up the new field of molecular biology and function...
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing (NGS) technologies are revolutionizing both medical and biological research through generation of massive SNP data sets for identifying heritable genome variation underlying key traits, from rare human diseases to important agronomic phenotypes in crop species. We evaluate...
Khmaladze, Ekaterine; Dzavashvili, Giorgi; Chanturia, Gvantsa; ...
2017-05-11
Bacillus anthracis causes the acute fatal disease anthrax, is a proven biological weapon, and is endemic in Georgia, where human and animal cases are reported annually. Furthermore, we present whole-genome sequences of 10 historical B. anthracis strains from Georgia.
Toxic plants: Effects on reproduction and fetal and embryonic development in livestock
USDA-ARS?s Scientific Manuscript database
Reproductive success is dependent on a large number of carefully orchestrated biological events that must occur in a specifically timed sequence. The interference with one of more of these sequences or events may result in total reproductive failure or a more subtle reduction in reproductive potent...
The role of next generation sequencing for the development and testing of veterinary biologics
USDA-ARS?s Scientific Manuscript database
Next generation sequencing technology has become widely available and it offers many new opportunities in vaccine technology. Both human and veterinary medicine has numerous examples of adventitious agents being found in live vaccines. In veterinary medicine a continuing trend is the use of viral ...
USDA-ARS?s Scientific Manuscript database
Advances in long-read, single molecule real-time sequencing technology and analysis software over the last two years has enabled the efficient production of closed bacterial genome sequences. However, consistent annotation of these genomes has lagged behind the ability to create them, while the avai...
RiboMaker: computational design of conformation-based riboregulation.
Rodrigo, Guillermo; Jaramillo, Alfonso
2014-09-01
The ability to engineer control systems of gene expression is instrumental for synthetic biology. Thus, bioinformatic methods that assist such engineering are appealing because they can guide the sequence design and prevent costly experimental screening. In particular, RNA is an ideal substrate to de novo design regulators of protein expression by following sequence-to-function models. We have implemented a novel algorithm, RiboMaker, aimed at the computational, automated design of bacterial riboregulation. RiboMaker reads the sequence and structure specifications, which codify for a gene regulatory behaviour, and optimizes the sequences of a small regulatory RNA and a 5'-untranslated region for an efficient intermolecular interaction. To this end, it implements an evolutionary design strategy, where random mutations are selected according to a physicochemical model based on free energies. The resulting sequences can then be tested experimentally, providing a new tool for synthetic biology, and also for investigating the riboregulation principles in natural systems. Web server is available at http://ribomaker.jaramillolab.org/. Source code, instructions and examples are freely available for download at http://sourceforge.net/projects/ribomaker/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Rapid construction of insulated genetic circuits via synthetic sequence-guided isothermal assembly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Torella, JP; Boehm, CR; Lienert, F
2013-12-28
In vitro recombination methods have enabled one-step construction of large DNA sequences from multiple parts. Although synthetic biological circuits can in principle be assembled in the same fashion, they typically contain repeated sequence elements such as standard promoters and terminators that interfere with homologous recombination. Here we use a computational approach to design synthetic, biologically inactive unique nucleotide sequences (UNSes) that facilitate accurate ordered assembly. Importantly, our designed UNSes make it possible to assemble parts with repeated terminator and insulator sequences, and thereby create insulated functional genetic circuits in bacteria and mammalian cells. Using UNS-guided assembly to construct repeating promoter-gene-terminatormore » parts, we systematically varied gene expression to optimize production of a deoxychromoviridans biosynthetic pathway in Escherichia coli. We then used this system to construct complex eukaryotic AND-logic gates for genomic integration into embryonic stem cells. Construction was performed by using a standardized series of UNS-bearing BioBrick-compatible vectors, which enable modular assembly and facilitate reuse of individual parts. UNS-guided isothermal assembly is broadly applicable to the construction and optimization of genetic circuits and particularly those requiring tight insulation, such as complex biosynthetic pathways, sensors, counters and logic gates.« less
An in silico DNA cloning experiment for the biochemistry laboratory.
Elkins, Kelly M
2011-01-01
This laboratory exercise introduces students to concepts in recombinant DNA technology while accommodating a major semester project in protein purification, structure, and function in a biochemistry laboratory for junior- and senior-level undergraduate students. It is also suitable for forensic science courses focused in DNA biology and advanced high school biology classes. Students begin by examining a plasmid map with the goal of identifying which restriction enzymes may be used to clone a piece of foreign DNA containing a gene of interest into the vector. From the National Center for Biotechnology Initiative website, students are instructed to retrieve a protein sequence and use Expasy's Reverse Translate program to reverse translate the protein to cDNA. Students then use Integrated DNA Technologies' OligoAnalyzer to predict the complementary DNA strand and obtain DNA recognition sequences for the desired restriction enzymes from New England Biolabs' website. Students add the appropriate DNA restriction sequences to the double-stranded foreign DNA for cloning into the plasmid and infecting Escherichia coli cells. Students are introduced to computational biology tools, molecular biology terminology and the process of DNA cloning in this valuable single session, in silico experiment. This project develops students' understanding of the cloning process as a whole and contrasts with other laboratory and internship experiences in which the students may be involved in only a piece of the cloning process/techniques. Students interested in pursuing postgraduate study and research or employment in an academic biochemistry or molecular biology laboratory or industry will benefit most from this experience. Copyright © 2010 Wiley Periodicals, Inc.
A Multicenter Study To Evaluate the Performance of High-Throughput Sequencing for Virus Detection
Ng, Siemon H. S.; Vandeputte, Olivier; Aljanahi, Aisha; Deyati, Avisek; Cassart, Jean-Pol; Charlebois, Robert L.; Taliaferro, Lanyn P.
2017-01-01
ABSTRACT The capability of high-throughput sequencing (HTS) for detection of known and unknown viruses makes it a powerful tool for broad microbial investigations, such as evaluation of novel cell substrates that may be used for the development of new biological products. However, like any new assay, regulatory applications of HTS need method standardization. Therefore, our three laboratories initiated a study to evaluate performance of HTS for potential detection of viral adventitious agents by spiking model viruses in different cellular matrices to mimic putative materials for manufacturing of biologics. Four model viruses were selected based upon different physical and biochemical properties and commercial availability: human respiratory syncytial virus (RSV), Epstein-Barr virus (EBV), feline leukemia virus (FeLV), and human reovirus (REO). Additionally, porcine circovirus (PCV) was tested by one laboratory. Independent samples were prepared for HTS by spiking intact viruses or extracted viral nucleic acids, singly or mixed, into different HeLa cell matrices (resuspended whole cells, cell lysate, or total cellular RNA). Data were obtained using different sequencing platforms (Roche 454, Illumina HiSeq1500 or HiSeq2500). Bioinformatic analyses were performed independently by each laboratory using available tools, pipelines, and databases. The results showed that comparable virus detection was obtained in the three laboratories regardless of sample processing, library preparation, sequencing platform, and bioinformatic analysis: between 0.1 and 3 viral genome copies per cell were detected for all of the model viruses used. This study highlights the potential for using HTS for sensitive detection of adventitious viruses in complex biological samples containing cellular background. IMPORTANCE Recent high-throughput sequencing (HTS) investigations have resulted in unexpected discoveries of known and novel viruses in a variety of sample types, including research materials, clinical materials, and biological products. Therefore, HTS can be a powerful tool for supplementing current methods for demonstrating the absence of adventitious or unwanted viruses in biological products, particularly when using a new cell line. However, HTS is a complex technology with different platforms, which needs standardization for evaluation of biologics. This collaborative study was undertaken to investigate detection of different virus types using two different HTS platforms. The results of the independently performed studies demonstrated a similar sensitivity of virus detection, regardless of the different sample preparation and processing procedures and bioinformatic analyses done in the three laboratories. Comparable HTS detection of different virus types supports future development of reference virus materials for standardization and validation of different HTS platforms. PMID:28932815
Comparison of Methods of Detection of Exceptional Sequences in Prokaryotic Genomes.
Rusinov, I S; Ershova, A S; Karyagina, A S; Spirin, S A; Alexeevski, A V
2018-02-01
Many proteins need recognition of specific DNA sequences for functioning. The number of recognition sites and their distribution along the DNA might be of biological importance. For example, the number of restriction sites is often reduced in prokaryotic and phage genomes to decrease the probability of DNA cleavage by restriction endonucleases. We call a sequence an exceptional one if its frequency in a genome significantly differs from one predicted by some mathematical model. An exceptional sequence could be either under- or over-represented, depending on its frequency in comparison with the predicted one. Exceptional sequences could be considered biologically meaningful, for example, as targets of DNA-binding proteins or as parts of abundant repetitive elements. Several methods to predict frequency of a short sequence in a genome, based on actual frequencies of certain its subsequences, are used. The most popular are methods based on Markov chain models. But any rigorous comparison of the methods has not previously been performed. We compared three methods for the prediction of short sequence frequencies: the maximum-order Markov chain model-based method, the method that uses geometric mean of extended Markovian estimates, and the method that utilizes frequencies of all subsequences including discontiguous ones. We applied them to restriction sites in complete genomes of 2500 prokaryotic species and demonstrated that the results depend greatly on the method used: lists of 5% of the most under-represented sites differed by up to 50%. The method designed by Burge and coauthors in 1992, which utilizes all subsequences of the sequence, showed a higher precision than the other two methods both on prokaryotic genomes and randomly generated sequences after computational imitation of selective pressure. We propose this method as the first choice for detection of exceptional sequences in prokaryotic genomes.
2011-01-01
Background BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. Results This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Conclusions Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed. PMID:21794110
Feltus, Frank A; Saski, Christopher A; Mockaitis, Keithanne; Haiminen, Niina; Parida, Laxmi; Smith, Zachary; Ford, James; Staton, Margaret E; Ficklin, Stephen P; Blackmon, Barbara P; Cheng, Chun-Huai; Schnell, Raymond J; Kuhn, David N; Motamayor, Juan-Carlos
2011-07-27
BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed.
RNAcentral: an international database of ncRNA sequences
Williams, Kelly Porter
2014-10-28
The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences, including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected species.
Sequencing technologies - the next generation.
Metzker, Michael L
2010-01-01
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.
Koloniuk, Igor; Fránová, Jana; Sarkisova, Tatiana; Přibylová, Jaroslava
2018-05-04
Strawberry crinkle disease is one of the major diseases that threatens strawberry production. Although the biological properties of the agent, strawberry crinkle virus (SCV), have been thoroughly investigated, its complete genome sequence has never been published. Existing RT-PCR-based detection relies on a partial sequence of the L protein gene, presumably the least expressed viral gene. Here, we present complete sequences of two divergent SCV isolates co-infecting a single plant, Fragaria x ananassa cv. Čačanská raná.
Sequence determination and analysis of the NSs genes of two tospoviruses.
Hallwass, Mariana; Leastro, Mikhail O; Lima, Mirtes F; Inoue-Nagata, Alice K; Resende, Renato O
2012-03-01
The tospoviruses groundnut ringspot virus (GRSV) and zucchini lethal chlorosis virus (ZLCV) cause severe losses in many crops, especially in solanaceous and cucurbit species. In this study, the non-structural NSs gene and the 5'UTRs of these two biologically distinct tospoviruses were cloned and sequenced. The NSs sequence of GRSV and ZLCV were both 1,404 nucleotides long. Pairwise comparison showed that the NSs amino acid sequence of GRSV shared 69.6% identity with that of ZLCV and 75.9% identity with that of TSWV, while the NSs sequence of ZLCV and TSWV shared 67.9% identity. Phylogenetic analysis based on NSs sequences confirmed that these viruses cluster in the American clade.
Biological Nanoplatforms for Self-Assembled Electronics
2015-03-24
as M13 , a virus that infects Escherichia coli. Approximately one billion different amino acid sequences are displayed on different viruses in the...sequence when contained within a phage M13 coat protein sequence, not chemically linked to the surface of phage MS2 VLPs. Thus, binding properties may...gallium arsenide in a bacteriophage M13 phage display library, MS2 VLPs modified with the metal binding peptides do not display the same activity
ERIC Educational Resources Information Center
Becker, David
2005-01-01
In spring of 1998, the Biology Department at Pomona College changed from a two-semester survey introductory biology sequence to a core set of three courses, none of which is a traditional survey course. They had been wrestling for several years with a number of issues regarding the survey courses, including (1) what topics to include and exclude;…
Optimal network alignment with graphlet degree vectors.
Milenković, Tijana; Ng, Weng Leong; Hayes, Wayne; Przulj, Natasa
2010-06-30
Important biological information is encoded in the topology of biological networks. Comparative analyses of biological networks are proving to be valuable, as they can lead to transfer of knowledge between species and give deeper insights into biological function, disease, and evolution. We introduce a new method that uses the Hungarian algorithm to produce optimal global alignment between two networks using any cost function. We design a cost function based solely on network topology and use it in our network alignment. Our method can be applied to any two networks, not just biological ones, since it is based only on network topology. We use our new method to align protein-protein interaction networks of two eukaryotic species and demonstrate that our alignment exposes large and topologically complex regions of network similarity. At the same time, our alignment is biologically valid, since many of the aligned protein pairs perform the same biological function. From the alignment, we predict function of yet unannotated proteins, many of which we validate in the literature. Also, we apply our method to find topological similarities between metabolic networks of different species and build phylogenetic trees based on our network alignment score. The phylogenetic trees obtained in this way bear a striking resemblance to the ones obtained by sequence alignments. Our method detects topologically similar regions in large networks that are statistically significant. It does this independent of protein sequence or any other information external to network topology.
Meyers, Steven R.; Khoo, Xiaojuan; Huang, Xin; Walsh, Elisabeth B.; Grinstaff, Mark W.; Kenan, Daniel J.
2013-01-01
Biomaterials used in implants have traditionally been selected based on their mechanical properties, chemical stability, and biocompatibility. However, the durability and clinical efficacy of implantable biomedical devices remains limited in part due to the absence of appropriate biological interactions at the implant interface and the lack of integration into adjacent tissues. Herein, we describe a robust peptide-based coating technology capable of modifying the surface of existing biomaterials and medical devices through the non-covalent binding of modular biofunctional peptides. These peptides contain at least one material binding sequence and at least one biologically active sequence and thus are termed, “Interfacial Biomaterials” (IFBMs). IFBMs can simultaneously bind the biomaterial surface while endowing it with desired biological functionalities at the interface between the material and biological realms. We demonstrate the capabilities of model IFBMs to convert native polystyrene, a bioinert surface, into a bioactive surface that can support a range of cell activities. We further distinguish between simple cell attachment with insufficient integrin interactions, which in some cases can adversely impact downstream biology, versus biologically appropriate adhesion, cell spreading, and cell survival mediated by IFBMs. Moreover, we show that we can use the coating technology to create spatially resolved patterns of fluorophores and cells on substrates and that these patterns retain their borders in culture. PMID:18929406
The emerging genomics and systems biology research lead to systems genomics studies.
Yang, Mary Qu; Yoshigoe, Kenji; Yang, William; Tong, Weida; Qin, Xiang; Dunker, A; Chen, Zhongxue; Arbania, Hamid R; Liu, Jun S; Niemierko, Andrzej; Yang, Jack Y
2014-01-01
Synergistically integrating multi-layer genomic data at systems level not only can lead to deeper insights into the molecular mechanisms related to disease initiation and progression, but also can guide pathway-based biomarker and drug target identification. With the advent of high-throughput next-generation sequencing technologies, sequencing both DNA and RNA has generated multi-layer genomic data that can provide DNA polymorphism, non-coding RNA, messenger RNA, gene expression, isoform and alternative splicing information. Systems biology on the other hand studies complex biological systems, particularly systematic study of complex molecular interactions within specific cells or organisms. Genomics and molecular systems biology can be merged into the study of genomic profiles and implicated biological functions at cellular or organism level. The prospectively emerging field can be referred to as systems genomics or genomic systems biology. The Mid-South Bioinformatics Centre (MBC) and Joint Bioinformatics Ph.D. Program of University of Arkansas at Little Rock and University of Arkansas for Medical Sciences are particularly interested in promoting education and research advancement in this prospectively emerging field. Based on past investigations and research outcomes, MBC is further utilizing differential gene and isoform/exon expression from RNA-seq and co-regulation from the ChiP-seq specific for different phenotypes in combination with protein-protein interactions, and protein-DNA interactions to construct high-level gene networks for an integrative genome-phoneme investigation at systems biology level.
75 FR 62820 - Screening Framework Guidance for Providers of Synthetic Double-Stranded DNA
Federal Register 2010, 2011, 2012, 2013, 2014
2010-10-13
... I. Summary Synthetic biology, the developing interdisciplinary field that focuses on both the design and fabrication of novel biological components and systems as well as the re-design and fabrication of... develop, maintain, and document protocols to determine if a sequence ``hit'' qualifies as a true...
Visualising "Junk" DNA through Bioinformatics
ERIC Educational Resources Information Center
Elwess, Nancy L.; Latourelle, Sandra M.; Cauthorn, Olivia
2005-01-01
One of the hottest areas of science today is the field in which biology, information technology,and computer science are merged into a single discipline called bioinformatics. This field enables the discovery and analysis of biological data, including nucleotide and amino acid sequences that are easily accessed through the use of computers. As…
Mobile element biology – new possibilities with high-throughput sequencing
Xing, Jinchuan; Witherspoon, David J.; Jorde, Lynn B.
2014-01-01
Mobile elements compose more than half of the human genome, but until recently their large-scale detection was time-consuming and challenging. With the development of new high-throughput sequencing technologies, the complete spectrum of mobile element variation in humans can now be identified and analyzed. Thousands of new mobile element insertions have been discovered, yielding new insights into mobile element biology, evolution, and genomic variation. We review several high-throughput methods, with an emphasis on techniques that specifically target mobile element insertions in humans, and we highlight recent applications of these methods in evolutionary studies and in the analysis of somatic alterations in human cancers. PMID:23312846
NASA Astrophysics Data System (ADS)
Arit, Turkan; Keskin, Burak; Firuzan, Esin; Cavas, Cagin Kandemir; Liu, Liwei; Cavas, Levent
2018-04-01
The report entitled "L. Liu, D. Li, F. Bai, A relative Lempel-Ziv complexity: Application to comparing biological sequences, Chem. Phys. Lett. 530 (2012) 107-112" mentions on the powerful construction of phylogenetic trees based on Lempel-Ziv algorithm. On the other hand, the method explained in the paper does not give promising result on the data set on invasive Caulerpa taxifolia in the Mediterranean Sea. The phylogenetic trees are obtained by the proposed method of the aforementioned paper in this short note.
Exploitation of peptide motif sequences and their use in nanobiotechnology.
Shiba, Kiyotaka
2010-08-01
Short amino acid sequences extracted from natural proteins or created using in vitro evolution systems are sometimes associated with particular biological functions. These peptides, called peptide motifs, can serve as functional units for the creation of various tools for nanobiotechnology. In particular, peptide motifs that have the ability to specifically recognize the surfaces of solid materials and to mineralize certain inorganic materials have been linking biological science to material science. Here, I review how these peptide motifs have been isolated from natural proteins or created using in vitro evolution systems, and how they have been used in the nanobiotechnology field. Copyright © 2010 Elsevier Ltd. All rights reserved.
Martinez, Carlos A.; Barr, Kenneth; Kim, Ah-Ram; Reinitz, John
2013-01-01
Synthetic biology offers novel opportunities for elucidating transcriptional regulatory mechanisms and enhancer logic. Complex cis-regulatory sequences—like the ones driving expression of the Drosophila even-skipped gene—have proven difficult to design from existing knowledge, presumably due to the large number of protein-protein interactions needed to drive the correct expression patterns of genes in multicellular organisms. This work discusses two novel computational methods for the custom design of enhancers that employ a sophisticated, empirically validated transcriptional model, optimization algorithms, and synthetic biology. These synthetic elements have both utilitarian and academic value, including improving existing regulatory models as well as evolutionary questions. The first method involves the use of simulated annealing to explore the sequence space for synthetic enhancers whose expression output fit a given search criterion. The second method uses a novel optimization algorithm to find functionally accessible pathways between two enhancer sequences. These paths describe a set of mutations wherein the predicted expression pattern does not significantly vary at any point along the path. Both methods rely on a predictive mathematical framework that maps the enhancer sequence space to functional output. PMID:23732772
Isothermal folding of a light-up bio-orthogonal RNA origami nanoribbon.
Torelli, Emanuela; Kozyra, Jerzy Wieslaw; Gu, Jing-Ying; Stimming, Ulrich; Piantanida, Luca; Voïtchovsky, Kislon; Krasnogor, Natalio
2018-05-03
RNA presents intringuing roles in many cellular processes and its versatility underpins many different applications in synthetic biology. Nonetheless, RNA origami as a method for nanofabrication is not yet fully explored and the majority of RNA nanostructures are based on natural pre-folded RNA. Here we describe a biologically inert and uniquely addressable RNA origami scaffold that self-assembles into a nanoribbon by seven staple strands. An algorithm is applied to generate a synthetic De Bruijn scaffold sequence that is characterized by the lack of biologically active sites and repetitions larger than a predetermined design parameter. This RNA scaffold and the complementary staples fold in a physiologically compatible isothermal condition. In order to monitor the folding, we designed a new split Broccoli aptamer system. The aptamer is divided into two nonfunctional sequences each of which is integrated into the 5' or 3' end of two staple strands complementary to the RNA scaffold. Using fluorescence measurements and in-gel imaging, we demonstrate that once RNA origami assembly occurs, the split aptamer sequences are brought into close proximity forming the aptamer and turning on the fluorescence. This light-up 'bio-orthogonal' RNA origami provides a prototype that can have potential for in vivo origami applications.
Advances in systems biology: computational algorithms and applications.
Huang, Yufei; Zhao, Zhongming; Xu, Hua; Shyr, Yu; Zhang, Bing
2012-01-01
The 2012 International Conference on Intelligent Biology and Medicine (ICIBM 2012) was held on April 22-24, 2012 in Nashville, Tennessee, USA. The conference featured six technical sessions, one tutorial session, one workshop, and 3 keynote presentations that covered state-of-the-art research activities in genomics, systems biology, and intelligent computing. In addition to a major emphasis on the next generation sequencing (NGS)-driven informatics, ICIBM 2012 aligned significant interests in systems biology and its applications in medicine. We highlight in this editorial the selected papers from the meeting that address the developments of novel algorithms and applications in systems biology.
Multiple hypothesis tracking for cluttered biological image sequences.
Chenouard, Nicolas; Bloch, Isabelle; Olivo-Marin, Jean-Christophe
2013-11-01
In this paper, we present a method for simultaneously tracking thousands of targets in biological image sequences, which is of major importance in modern biology. The complexity and inherent randomness of the problem lead us to propose a unified probabilistic framework for tracking biological particles in microscope images. The framework includes realistic models of particle motion and existence and of fluorescence image features. For the track extraction process per se, the very cluttered conditions motivate the adoption of a multiframe approach that enforces tracking decision robustness to poor imaging conditions and to random target movements. We tackle the large-scale nature of the problem by adapting the multiple hypothesis tracking algorithm to the proposed framework, resulting in a method with a favorable tradeoff between the model complexity and the computational cost of the tracking procedure. When compared to the state-of-the-art tracking techniques for bioimaging, the proposed algorithm is shown to be the only method providing high-quality results despite the critically poor imaging conditions and the dense target presence. We thus demonstrate the benefits of advanced Bayesian tracking techniques for the accurate computational modeling of dynamical biological processes, which is promising for further developments in this domain.
Karaboga, D; Aslan, S
2016-04-27
The great majority of biological sequences share significant similarity with other sequences as a result of evolutionary processes, and identifying these sequence similarities is one of the most challenging problems in bioinformatics. In this paper, we present a discrete artificial bee colony (ABC) algorithm, which is inspired by the intelligent foraging behavior of real honey bees, for the detection of highly conserved residue patterns or motifs within sequences. Experimental studies on three different data sets showed that the proposed discrete model, by adhering to the fundamental scheme of the ABC algorithm, produced competitive or better results than other metaheuristic motif discovery techniques.
Rapid and accurate pyrosequencing of angiosperm plastid genomes
Moore, Michael J; Dhingra, Amit; Soltis, Pamela S; Shaw, Regina; Farmerie, William G; Folta, Kevin M; Soltis, Douglas E
2006-01-01
Background Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20) System (454 Life Sciences Corporation), to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae) and Platanus occidentalis (Platanaceae). Results More than 99.75% of each plastid genome was simultaneously obtained during two GS 20 sequence runs, to an average depth of coverage of 24.6× in Nandina and 17.3× in Platanus. The Nandina and Platanus plastid genomes shared essentially identical gene complements and possessed the typical angiosperm plastid structure and gene arrangement. To assess the accuracy of the GS 20 sequence, over 45 kilobases of sequence were generated for each genome using conventional sequencing. Overall error rates of 0.043% and 0.031% were observed in GS 20 sequence for Nandina and Platanus, respectively. More than 97% of all observed errors were associated with homopolymer runs, with ~60% of all errors associated with homopolymer runs of 5 or more nucleotides and ~50% of all errors associated with regions of extensive homopolymer runs. No substitution errors were present in either genome. Error rates were generally higher in the single-copy and noncoding regions of both plastid genomes relative to the inverted repeat and coding regions. Conclusion Highly accurate and essentially complete sequence information was obtained for the Nandina and Platanus plastid genomes using the GS 20 System. More importantly, the high accuracy observed in the GS 20 plastid genome sequence was generated for a significant reduction in time and cost over traditional shotgun-based genome sequencing techniques, although with approximately half the coverage of previously reported GS 20 de novo genome sequence. The GS 20 should be broadly applicable to angiosperm plastid genome sequencing, and therefore promises to expand the scale of plant genetic and phylogenetic research dramatically. PMID:16934154
Sequencing of Seven Haloarchaeal Genomes Reveals Patterns of Genomic Flux
Lynch, Erin A.; Langille, Morgan G. I.; Darling, Aaron; Wilbanks, Elizabeth G.; Haltiner, Caitlin; Shao, Katie S. Y.; Starr, Michael O.; Teiling, Clotilde; Harkins, Timothy T.; Edwards, Robert A.; Eisen, Jonathan A.; Facciotti, Marc T.
2012-01-01
We report the sequencing of seven genomes from two haloarchaeal genera, Haloferax and Haloarcula. Ease of cultivation and the existence of well-developed genetic and biochemical tools for several diverse haloarchaeal species make haloarchaea a model group for the study of archaeal biology. The unique physiological properties of these organisms also make them good candidates for novel enzyme discovery for biotechnological applications. Seven genomes were sequenced to ∼20×coverage and assembled to an average of 50 contigs (range 5 scaffolds - 168 contigs). Comparisons of protein-coding gene compliments revealed large-scale differences in COG functional group enrichment between these genera. Analysis of genes encoding machinery for DNA metabolism reveals genera-specific expansions of the general transcription factor TATA binding protein as well as a history of extensive duplication and horizontal transfer of the proliferating cell nuclear antigen. Insights gained from this study emphasize the importance of haloarchaea for investigation of archaeal biology. PMID:22848480
Vipsita, Swati; Rath, Santanu Kumar
2015-01-01
Protein superfamily classification deals with the problem of predicting the family membership of newly discovered amino acid sequence. Although many trivial alignment methods are already developed by previous researchers, but the present trend demands the application of computational intelligent techniques. As there is an exponential growth in size of biological database, retrieval and inference of essential knowledge in the biological domain become a very cumbersome task. This problem can be easily handled using intelligent techniques due to their ability of tolerance for imprecision, uncertainty, approximate reasoning, and partial truth. This paper discusses the various global and local features extracted from full length protein sequence which are used for the approximation and generalisation of the classifier. The various parameters used for evaluating the performance of the classifiers are also discussed. Therefore, this review article can show right directions to the present researchers to make an improvement over the existing methods.
Defining functional distance using manifold embeddings of gene ontology annotations
Lerman, Gilad; Shakhnovich, Boris E.
2007-01-01
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300
2010-01-01
Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Conclusions Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements. PMID:20205909
Nuel, Gregory; Regad, Leslie; Martin, Juliette; Camproux, Anne-Claude
2010-01-26
In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
DWARF – a data warehouse system for analyzing protein families
Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen
2006-01-01
Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801
ERGC: an efficient referential genome compression algorithm.
Saha, Subrata; Rajasekaran, Sanguthevar
2015-11-01
Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression. We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising. The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/ERGC.zip. rajasek@engr.uconn.edu. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Deep sequencing of cardiac microRNA-mRNA interactomes in clinical and experimental cardiomyopathy
Matkovich, Scot J.; Dorn, Gerald W.
2018-01-01
Summary MicroRNAs are a family of short (~21 nucleotide) noncoding RNAs that serve key roles in cellular growth and differentiation and the response of the heart to stress stimuli. As the sequence-specific recognition element of RNA-induced silencing complexes (RISCs), microRNAs bind mRNAs and prevent their translation via mechanisms that may include transcript degradation and/or prevention of ribosome binding. Short microRNA sequences and the ability of microRNAs to bind to mRNA sites having only partial/imperfect sequence complementarity complicates purely computational analyses of microRNA-mRNA interactomes. Furthermore, computational microRNA target prediction programs typically ignore biological context, and therefore the principal determinants of microRNA-mRNA binding: the presence and quantity of each. To address these deficiencies we describe an empirical method, developed via studies of stressed and failing hearts, to determine disease-induced changes in microRNAs, mRNAs, and the mRNAs targeted to the RISC, without cross-linking mRNAs to RISC proteins. Deep sequencing methods are used to determine RNA abundances, delivering unbiased, quantitative RNA data limited only by their annotation in the genome of interest. We describe the laboratory bench steps required to perform these experiments, experimental design strategies to achieve an appropriate number of sequencing reads per biological replicate, and computer-based processing tools and procedures to convert large raw sequencing data files into gene expression measures useful for differential expression analyses. PMID:25836573
Deep sequencing of cardiac microRNA-mRNA interactomes in clinical and experimental cardiomyopathy.
Matkovich, Scot J; Dorn, Gerald W
2015-01-01
MicroRNAs are a family of short (~21 nucleotide) noncoding RNAs that serve key roles in cellular growth and differentiation and the response of the heart to stress stimuli. As the sequence-specific recognition element of RNA-induced silencing complexes (RISCs), microRNAs bind mRNAs and prevent their translation via mechanisms that may include transcript degradation and/or prevention of ribosome binding. Short microRNA sequences and the ability of microRNAs to bind to mRNA sites having only partial/imperfect sequence complementarity complicate purely computational analyses of microRNA-mRNA interactomes. Furthermore, computational microRNA target prediction programs typically ignore biological context, and therefore the principal determinants of microRNA-mRNA binding: the presence and quantity of each. To address these deficiencies we describe an empirical method, developed via studies of stressed and failing hearts, to determine disease-induced changes in microRNAs, mRNAs, and the mRNAs targeted to the RISC, without cross-linking mRNAs to RISC proteins. Deep sequencing methods are used to determine RNA abundances, delivering unbiased, quantitative RNA data limited only by their annotation in the genome of interest. We describe the laboratory bench steps required to perform these experiments, experimental design strategies to achieve an appropriate number of sequencing reads per biological replicate, and computer-based processing tools and procedures to convert large raw sequencing data files into gene expression measures useful for differential expression analyses.
Ruff, Kiersten M; Roberts, Stefan; Chilkoti, Ashutosh; Pappu, Rohit V
2018-06-24
Proteins and synthetic polymers can undergo phase transitions in response to changes to intensive solution parameters such as temperature, proton chemical potentials (pH), and hydrostatic pressure. For proteins and protein-based polymers, the information required for stimulus responsive phase transitions is encoded in their amino acid sequence. Here, we review some of the key physical principles that govern the phase transitions of archetypal intrinsically disordered protein polymers (IDPPs). These are disordered proteins with highly repetitive amino acid sequences. Advances in recombinant technologies have enabled the design and synthesis of protein sequences of a variety of sequence complexities and lengths. We summarize insights that have been gleaned from the design and characterization of IDPPs that undergo thermo-responsive phase transitions and build on these insights to present a general framework for IDPPs with pH and pressure responsive phase behavior. In doing so, we connect the stimulus responsive phase behavior of IDPPs with repetitive sequences to the coil-to-globule transitions that these sequences undergo at the single chain level in response to changes in stimuli. The proposed framework and ongoing studies of stimulus responsive phase behavior of designed IDPPs have direct implications in bioengineering, where designing sequences with bespoke material properties broadens the spectrum of applications, and in biology and medicine for understanding the sequence-specific driving forces for the formation of protein-based membraneless organelles as well as biological matrices that act as scaffolds for cells and mediators of cell-to-cell communication. Copyright © 2018. Published by Elsevier Ltd.
USDA-ARS?s Scientific Manuscript database
Complementing quantitative methods with sequence data analysis is a major goal of the post-genome era of biology. In this study, we analyzed Illumina HiSeq sequence data derived from 11 US Holstein bulls in order to identify putative causal mutations associated with calving and conformation traits. ...
Shannon C.K. Straub; Mark Fishbein; Tatyana Livshult; Zachary Foster; Matthew Parks; Kevin Weitemier; Richard C. Cronn; Aaron Liston
2011-01-01
Milkweeds (Asclepias L.) have been extensively investigated in diverse areas of evolutionary biology and ecology; however, there are few genetic resources available to facilitate and compliment these studies. This study explored how low coverage genome sequencing of the common milkweed (Asclepias syriaca L.) could be useful in...
Giardia lamblia: Molecular Studies of an Early Branching Eukaryote
USDA-ARS?s Scientific Manuscript database
The rapid advance in our understanding of the biology of Giardia lamblia over the last several years is due in part to the complete DNA sequencing of the 11.7 Mb genome of this diplomonad. Insight on the molecular nature of G. lamblia has been gained by searching the genome using query sequences fr...
Topic Sequence and Emphasis Variability of Selected Organic Chemistry Textbooks
ERIC Educational Resources Information Center
Houseknecht, Justin B.
2010-01-01
Textbook choice has a significant effect upon course success. Among the factors that influence this decision, two of the most important are material organization and emphasis. This paper examines the sequencing of 19 organic chemistry topics, 21 concepts and skills, and 7 biological topics within nine of the currently available organic textbooks.…
The W22 genome: a foundation for maize functional genomics and transposon biology
USDA-ARS?s Scientific Manuscript database
The maize W22 inbred has served as a platform for maize genetics since the mid twentieth century. To streamline maize genome analyses, we have sequenced and de novo assembled a W22 reference genome using small-read sequencing technologies. We show that significant structural heterogeneity exists in ...
Applications and Extensions of pClust to Big Microbial Proteomic Data
ERIC Educational Resources Information Center
Lockwood, Svetlana
2016-01-01
The goal of biological sciences is to understand the biomolecular mechanics of living organisms. Proteins serve as the foundation for organisms functional analysis and sequence analysis has shown to be invaluable in answering questions about individual organisms. The first step in any sequence analysis is alignment and it is common that even…
USDA-ARS?s Scientific Manuscript database
Little is known about genetic variation of Lymantria dispar multiple nucleopolyhedrovirus (LdMNPV; Baculoviridae: Alphabaculovirus) at the nucleotide sequence level. To obtain a more comprehensive view of genetic diversity among isolates of LdMNPV, partial sequences of the lef-8 gene were generated...
Teaching the Process of Molecular Phylogeny and Systematics: A Multi-Part Inquiry-Based Exercise
ERIC Educational Resources Information Center
Lents, Nathan H.; Cifuentes, Oscar E.; Carpi, Anthony
2010-01-01
Three approaches to molecular phylogenetics are demonstrated to biology students as they explore molecular data from "Homo sapiens" and four related primates. By analyzing DNA sequences, protein sequences, and chromosomal maps, students are repeatedly challenged to develop hypotheses regarding the ancestry of the five species. Although…
ERIC Educational Resources Information Center
Bowling, Bethany; Zimmer, Erin; Pyatt, Robert E.
2014-01-01
Although the development of next-generation (NextGen) sequencing technologies has revolutionized genomic research and medicine, the incorporation of these topics into the classroom is challenging, given an implied high degree of technical complexity. We developed an easy-to-implement, interactive classroom activity investigating the similarities…
Technology-Enhanced Research in the Science Classroom.
ERIC Educational Resources Information Center
Francis, Joseph W.
1997-01-01
Describes a project where students use the Internet as a research tool. Discusses using e-mail to access molecular biology databases and identify proteins using amino acid sequences, obtaining complete amino acid sequences using the world wide web, using telnet to access library resources on the Internet, and various stages of protein analysis…
Information theory applications for biological sequence analysis.
Vinga, Susana
2014-05-01
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
The value of new genome references.
Worley, Kim C; Richards, Stephen; Rogers, Jeffrey
2017-09-15
Genomic information has become a ubiquitous and almost essential aspect of biological research. Over the last 10-15 years, the cost of generating sequence data from DNA or RNA samples has dramatically declined and our ability to interpret those data increased just as remarkably. Although it is still possible for biologists to conduct interesting and valuable research on species for which genomic data are not available, the impact of having access to a high quality whole genome reference assembly for a given species is nothing short of transformational. Research on a species for which we have no DNA or RNA sequence data is restricted in fundamental ways. In contrast, even access to an initial draft quality genome (see below for definitions) opens a wide range of opportunities that are simply not available without that reference genome assembly. Although a complete discussion of the impact of genome sequencing and assembly is beyond the scope of this short paper, the goal of this review is to summarize the most common and highest impact contributions that whole genome sequencing and assembly has had on comparative and evolutionary biology. Copyright © 2016. Published by Elsevier Inc.
Human genome project: revolutionizing biology through leveraging technology
NASA Astrophysics Data System (ADS)
Dahl, Carol A.; Strausberg, Robert L.
1996-04-01
The Human Genome Project (HGP) is an international project to develop genetic, physical, and sequence-based maps of the human genome. Since the inception of the HGP it has been clear that substantially improved technology would be required to meet the scientific goals, particularly in order to acquire the complete sequence of the human genome, and that these technologies coupled with the information forthcoming from the project would have a dramatic effect on the way biomedical research is performed in the future. In this paper, we discuss the state-of-the-art for genomic DNA sequencing, technological challenges that remain, and the potential technological paths that could yield substantially improved genomic sequencing technology. The impact of the technology developed from the HGP is broad-reaching and a discussion of other research and medical applications that are leveraging HGP-derived DNA analysis technologies is included. The multidisciplinary approach to the development of new technologies that has been successful for the HGP provides a paradigm for facilitating new genomic approaches toward understanding the biological role of functional elements and systems within the cell, including those encoded within genomic DNA and their molecular products.
Young, Robert S
2016-07-01
Frequent evolutionary birth and death events have created a large quantity of biologically important, lineage-specific DNA within mammalian genomes. The birth and death of DNA sequences is so frequent that the total number of these insertions and deletions in the human population remains unknown, although there are differences between these groups, e.g. transposable elements contribute predominantly to sequence insertion. Functional turnover - where the activity of a locus is specific to one lineage, but the underlying DNA remains conserved - can also drive birth and death. However, this does not appear to be a major driver of divergent transcriptional regulation. Both sequence and functional turnover have contributed to the birth and death of thousands of functional promoters in the human and mouse genomes. These findings reveal the pervasive nature of evolutionary birth and death and suggest that lineage-specific regions may play an important but previously underappreciated role in human biology and disease. © 2016 The Authors BioEssays Published by WILEY Periodicals, Inc.
StrBioLib: a Java library for development of custom computational structural biology applications.
Chandonia, John-Marc
2007-08-01
StrBioLib is a library of Java classes useful for developing software for computational structural biology research. StrBioLib contains classes to represent and manipulate protein structures, biopolymer sequences, sets of biopolymer sequences, and alignments between biopolymers based on either sequence or structure. Interfaces are provided to interact with commonly used bioinformatics applications, including (psi)-blast, modeller, muscle and Primer3, and tools are provided to read and write many file formats used to represent bioinformatic data. The library includes a general-purpose neural network object with multiple training algorithms, the Hooke and Jeeves non-linear optimization algorithm, and tools for efficient C-style string parsing and formatting. StrBioLib is the basis for the Pred2ary secondary structure prediction program, is used to build the astral compendium for sequence and structure analysis, and has been extensively tested through use in many smaller projects. Examples and documentation are available at the site below. StrBioLib may be obtained under the terms of the GNU LGPL license from http://strbio.sourceforge.net/
Discovering Motifs in Biological Sequences Using the Micron Automata Processor.
Roy, Indranil; Aluru, Srinivas
2016-01-01
Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.
Draft De Novo Transcriptome of the Rat Kangaroo Potorous tridactylus as a Tool for Cell Biology
Udy, Dylan B.; Voorhies, Mark; Chan, Patricia P.; Lowe, Todd M.; Dumont, Sophie
2015-01-01
The rat kangaroo (long-nosed potoroo, Potorous tridactylus) is a marsupial native to Australia. Cultured rat kangaroo kidney epithelial cells (PtK) are commonly used to study cell biological processes. These mammalian cells are large, adherent, and flat, and contain large and few chromosomes—and are thus ideal for imaging intra-cellular dynamics such as those of mitosis. Despite this, neither the rat kangaroo genome nor transcriptome have been sequenced, creating a challenge for probing the molecular basis of these cellular dynamics. Here, we present the sequencing, assembly and annotation of the draft rat kangaroo de novo transcriptome. We sequenced 679 million reads that mapped to 347,323 Trinity transcripts and 20,079 Unigenes. We present statistics emerging from transcriptome-wide analyses, and analyses suggesting that the transcriptome covers full-length sequences of most genes, many with multiple isoforms. We also validate our findings with a proof-of-concept gene knockdown experiment. We expect that this high quality transcriptome will make rat kangaroo cells a more tractable system for linking molecular-scale function and cellular-scale dynamics. PMID:26252667
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin
2015-01-01
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
WormBase 2014: new views of curated biology
Harris, Todd W.; Baran, Joachim; Bieri, Tamberlyn; Cabunoc, Abigail; Chan, Juancarlos; Chen, Wen J.; Davis, Paul; Done, James; Grove, Christian; Howe, Kevin; Kishore, Ranjana; Lee, Raymond; Li, Yuling; Muller, Hans-Michael; Nakamura, Cecilia; Ozersky, Philip; Paulini, Michael; Raciti, Daniela; Schindelman, Gary; Tuli, Mary Ann; Auken, Kimberly Van; Wang, Daniel; Wang, Xiaodong; Williams, Gary; Wong, J. D.; Yook, Karen; Schedl, Tim; Hodgkin, Jonathan; Berriman, Matthew; Kersey, Paul; Spieth, John; Stein, Lincoln; Sternberg, Paul W.
2014-01-01
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest. PMID:24194605
Wan, Cen; Lees, Jonathan G; Minneci, Federico; Orengo, Christine A; Jones, David T
2017-10-01
Accurate gene or protein function prediction is a key challenge in the post-genome era. Most current methods perform well on molecular function prediction, but struggle to provide useful annotations relating to biological process functions due to the limited power of sequence-based features in that functional domain. In this work, we systematically evaluate the predictive power of temporal transcription expression profiles for protein function prediction in Drosophila melanogaster. Our results show significantly better performance on predicting protein function when transcription expression profile-based features are integrated with sequence-derived features, compared with the sequence-derived features alone. We also observe that the combination of expression-based and sequence-based features leads to further improvement of accuracy on predicting all three domains of gene function. Based on the optimal feature combinations, we then propose a novel multi-classifier-based function prediction method for Drosophila melanogaster proteins, FFPred-fly+. Interpreting our machine learning models also allows us to identify some of the underlying links between biological processes and developmental stages of Drosophila melanogaster.
Draft De Novo Transcriptome of the Rat Kangaroo Potorous tridactylus as a Tool for Cell Biology.
Udy, Dylan B; Voorhies, Mark; Chan, Patricia P; Lowe, Todd M; Dumont, Sophie
2015-01-01
The rat kangaroo (long-nosed potoroo, Potorous tridactylus) is a marsupial native to Australia. Cultured rat kangaroo kidney epithelial cells (PtK) are commonly used to study cell biological processes. These mammalian cells are large, adherent, and flat, and contain large and few chromosomes-and are thus ideal for imaging intra-cellular dynamics such as those of mitosis. Despite this, neither the rat kangaroo genome nor transcriptome have been sequenced, creating a challenge for probing the molecular basis of these cellular dynamics. Here, we present the sequencing, assembly and annotation of the draft rat kangaroo de novo transcriptome. We sequenced 679 million reads that mapped to 347,323 Trinity transcripts and 20,079 Unigenes. We present statistics emerging from transcriptome-wide analyses, and analyses suggesting that the transcriptome covers full-length sequences of most genes, many with multiple isoforms. We also validate our findings with a proof-of-concept gene knockdown experiment. We expect that this high quality transcriptome will make rat kangaroo cells a more tractable system for linking molecular-scale function and cellular-scale dynamics.
Linking soil biology and chemistry in biological soil crust using isolate exometabolomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Swenson, Tami L.; Karaoz, Ulas; Swenson, Joel M.
Metagenomic sequencing provides a window into microbial community structure and metabolic potential; however, linking these data to exogenous metabolites that microorganisms process and produce (the exometabolome) remains challenging. Previously, we observed strong exometabolite niche partitioning among bacterial isolates from biological soil crust (biocrust). For this study, we examine native biocrust to determine if these patterns are reproduced in the environment. Overall, most soil metabolites display the expected relationship (positive or negative correlation) with four dominant bacteria following a wetting event and across biocrust developmental stages. For metabolites that were previously found to be consumed by an isolate, 70% are negativelymore » correlated with the abundance of the isolate's closest matching environmental relative in situ, whereas for released metabolites, 67% were positively correlated. Our results demonstrate that metabolite profiling, shotgun sequencing and exometabolomics may be successfully integrated to functionally link microbial community structure with environmental chemistry in biocrust.« less
Linking soil biology and chemistry in biological soil crust using isolate exometabolomics
Swenson, Tami L.; Karaoz, Ulas; Swenson, Joel M.; ...
2018-01-02
Metagenomic sequencing provides a window into microbial community structure and metabolic potential; however, linking these data to exogenous metabolites that microorganisms process and produce (the exometabolome) remains challenging. Previously, we observed strong exometabolite niche partitioning among bacterial isolates from biological soil crust (biocrust). For this study, we examine native biocrust to determine if these patterns are reproduced in the environment. Overall, most soil metabolites display the expected relationship (positive or negative correlation) with four dominant bacteria following a wetting event and across biocrust developmental stages. For metabolites that were previously found to be consumed by an isolate, 70% are negativelymore » correlated with the abundance of the isolate's closest matching environmental relative in situ, whereas for released metabolites, 67% were positively correlated. Our results demonstrate that metabolite profiling, shotgun sequencing and exometabolomics may be successfully integrated to functionally link microbial community structure with environmental chemistry in biocrust.« less
Potential in vivo roles of nucleic acid triple-helices
Buske, Fabian A
2011-01-01
The ability of double-stranded DNA to form a triple-helical structure by hydrogen bonding with a third strand is well established, but the biological functions of these structures remain largely unknown. There is considerable albeit circumstantial evidence for the existence of nucleic triplexes in vivo and their potential participation in a variety of biological processes including chromatin organization, DNA repair, transcriptional regulation and RNA processing has been investigated in a number of studies to date. There is also a range of possible mechanisms to regulate triplex formation through differential expression of triplex-forming RNAs, alteration of chromatin accessibility, sequence unwinding and nucleotide modifications. With the advent of next generation sequencing technology combined with targeted approaches to isolate triplexes, it is now possible to survey triplex formation with respect to their genomic context, abundance and dynamical changes during differentiation and development, which may open up new vistas in understanding genome biology and gene regulation. PMID:21525785
Technological advances in precision medicine and drug development.
Maggi, Elaine; Patterson, Nicole E; Montagna, Cristina
New technologies are rapidly becoming available to expand the arsenal of tools accessible for precision medicine and to support the development of new therapeutics. Advances in liquid biopsies, which analyze cells, DNA, RNA, proteins, or vesicles isolated from the blood, have gained particular interest for their uses in acquiring information reflecting the biology of tumors and metastatic tissues. Through advancements in DNA sequencing that have merged unprecedented accuracy with affordable cost, personalized treatments based on genetic variations are becoming a real possibility. Extraordinary progress has been achieved in the development of biological therapies aimed to even further advance personalized treatments. We provide a summary of current and future applications of blood based liquid biopsies and how new technologies are utilized for the development of biological therapeutic treatments. We discuss current and future sequencing methods with an emphasis on how technological advances will support the progress in the field of precision medicine.
Integrating sequence and structural biology with DAS
Prlić, Andreas; Down, Thomas A; Kulesha, Eugene; Finn, Robert D; Kähäri, Andreas; Hubbard, Tim JP
2007-01-01
Background The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence. Results Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources. Conclusion Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at PMID:17850653
Liu, Wei-Guo; Liang, Cun-Zhen; Yang, Jin-Sheng; Wang, Gui-Ping; Liu, Miao-Miao
2013-02-01
The bacterial diversity in the biological desulfurization reactor operated continuously for 1 year was studied by the 16S rDNA cloning and sequencing method. Forty clones were randomly selected and their partial 16S rDNA genes (ca. 1,400 bp) were sequenced and blasted. The results indicated that there were dominant bacterias in the biological desulfurization reactor, where 33 clones belonged to 3 different published phyla, while 1 clone belonged to unknown phylum. The dominant bacterial community in the system was Proteobacteria, which accounted for 85.3%. The bacterial community succession was as follows: the gamma-Proteobacteria(55.9%), beta-Proteobacteria(17.6%), Actinobacteridae (8.8%), delta-Proteobacteria (5.9%) , alpha-Proteobacteria(5.9%), and Sphingobacteria (2.9%). Halothiobacillus sp. ST15 and Thiobacillus sp. UAM-I were the major desulfurization strains.
The planarian flatworm: an in vivo model for stem cell biology and nervous system regeneration
Gentile, Luca; Cebrià, Francesc; Bartscherer, Kerstin
2011-01-01
Planarian flatworms are an exception among bilaterians in that they possess a large pool of adult stem cells that enables them to promptly regenerate any part of their body, including the brain. Although known for two centuries for their remarkable regenerative capabilities, planarians have only recently emerged as an attractive model for studying regeneration and stem cell biology. This revival is due in part to the availability of a sequenced genome and the development of new technologies, such as RNA interference and next-generation sequencing, which facilitate studies of planarian regeneration at the molecular level. Here, we highlight why planarians are an exciting tool in the study of regeneration and its underlying stem cell biology in vivo, and discuss the potential promises and current limitations of this model organism for stem cell research and regenerative medicine. PMID:21135057
Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.
Pellegrini, Marco
2015-01-01
Tandem repetitions in protein sequence and structure is a fascinating subject of research which has been a focus of study since the late 1990s. In this survey, we give an overview on the multi-faceted aspects of research on protein tandem repeats (PTR for short), including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins. Detection of PTR either from protein sequence or structure data is challenging due to inherent high (biological) signal-to-noise ratio that is a key feature of this problem. As early in silico analytic tools have been key enablers for starting this field of study, we expect that current and future algorithmic and statistical breakthroughs will have a high impact on the investigations of the biological role of PTR.
BioPartsDB: a synthetic biology workflow web-application for education and research.
Stracquadanio, Giovanni; Yang, Kun; Boeke, Jef D; Bader, Joel S
2016-11-15
Synthetic biology has become a widely used technology, and expanding applications in research, education and industry require progress tracking for team-based DNA synthesis projects. Although some vendors are beginning to supply multi-kilobase sequence-verified constructs, synthesis workflows starting with short oligos remain important for cost savings and pedagogical benefit. We developed BioPartsDB as an open source, extendable workflow management system for synthetic biology projects with entry points for oligos and larger DNA constructs and ending with sequence-verified clones. BioPartsDB is released under the MIT license and available for download at https://github.com/baderzone/biopartsdb Additional documentation and video tutorials are available at https://github.com/baderzone/biopartsdb/wiki An Amazon Web Services image is available from the AWS Market Place (ami-a01d07c8). joel.bader@jhu.edu. © The Author 2016. Published by Oxford University Press.
Linking soil biology and chemistry in biological soil crust using isolate exometabolomics.
Swenson, Tami L; Karaoz, Ulas; Swenson, Joel M; Bowen, Benjamin P; Northen, Trent R
2018-01-02
Metagenomic sequencing provides a window into microbial community structure and metabolic potential; however, linking these data to exogenous metabolites that microorganisms process and produce (the exometabolome) remains challenging. Previously, we observed strong exometabolite niche partitioning among bacterial isolates from biological soil crust (biocrust). Here we examine native biocrust to determine if these patterns are reproduced in the environment. Overall, most soil metabolites display the expected relationship (positive or negative correlation) with four dominant bacteria following a wetting event and across biocrust developmental stages. For metabolites that were previously found to be consumed by an isolate, 70% are negatively correlated with the abundance of the isolate's closest matching environmental relative in situ, whereas for released metabolites, 67% were positively correlated. Our results demonstrate that metabolite profiling, shotgun sequencing and exometabolomics may be successfully integrated to functionally link microbial community structure with environmental chemistry in biocrust.
Revisiting Robustness and Evolvability: Evolution in Weighted Genotype Spaces
Partha, Raghavendran; Raman, Karthik
2014-01-01
Robustness and evolvability are highly intertwined properties of biological systems. The relationship between these properties determines how biological systems are able to withstand mutations and show variation in response to them. Computational studies have explored the relationship between these two properties using neutral networks of RNA sequences (genotype) and their secondary structures (phenotype) as a model system. However, these studies have assumed every mutation to a sequence to be equally likely; the differences in the likelihood of the occurrence of various mutations, and the consequence of probabilistic nature of the mutations in such a system have previously been ignored. Associating probabilities to mutations essentially results in the weighting of genotype space. We here perform a comparative analysis of weighted and unweighted neutral networks of RNA sequences, and subsequently explore the relationship between robustness and evolvability. We show that assuming an equal likelihood for all mutations (as in an unweighted network), underestimates robustness and overestimates evolvability of a system. In spite of discarding this assumption, we observe that a negative correlation between sequence (genotype) robustness and sequence evolvability persists, and also that structure (phenotype) robustness promotes structure evolvability, as observed in earlier studies using unweighted networks. We also study the effects of base composition bias on robustness and evolvability. Particularly, we explore the association between robustness and evolvability in a sequence space that is AU-rich – sequences with an AU content of 80% or higher, compared to a normal (unbiased) sequence space. We find that evolvability of both sequences and structures in an AU-rich space is lesser compared to the normal space, and robustness higher. We also observe that AU-rich populations evolving on neutral networks of phenotypes, can access less phenotypic variation compared to normal populations evolving on neutral networks. PMID:25390641
Cho, Jin-Young; Lee, Hyoung-Joo; Jeong, Seul-Ki; Paik, Young-Ki
2017-12-01
Mass spectrometry (MS) is a widely used proteome analysis tool for biomedical science. In an MS-based bottom-up proteomic approach to protein identification, sequence database (DB) searching has been routinely used because of its simplicity and convenience. However, searching a sequence DB with multiple variable modification options can increase processing time, false-positive errors in large and complicated MS data sets. Spectral library searching is an alternative solution, avoiding the limitations of sequence DB searching and allowing the detection of more peptides with high sensitivity. Unfortunately, this technique has less proteome coverage, resulting in limitations in the detection of novel and whole peptide sequences in biological samples. To solve these problems, we previously developed the "Combo-Spec Search" method, which uses manually multiple references and simulated spectral library searching to analyze whole proteomes in a biological sample. In this study, we have developed a new analytical interface tool called "Epsilon-Q" to enhance the functions of both the Combo-Spec Search method and label-free protein quantification. Epsilon-Q performs automatically multiple spectral library searching, class-specific false-discovery rate control, and result integration. It has a user-friendly graphical interface and demonstrates good performance in identifying and quantifying proteins by supporting standard MS data formats and spectrum-to-spectrum matching powered by SpectraST. Furthermore, when the Epsilon-Q interface is combined with the Combo-Spec search method, called the Epsilon-Q system, it shows a synergistic function by outperforming other sequence DB search engines for identifying and quantifying low-abundance proteins in biological samples. The Epsilon-Q system can be a versatile tool for comparative proteome analysis based on multiple spectral libraries and label-free quantification.
Halmillawewa, Anupama P; Restrepo-Córdoba, Marcela; Perry, Benjamin J; Yost, Christopher K; Hynes, Michael F
2016-02-01
Bacteriophages may play an important role in regulating population size and diversity of the root nodule symbiont Rhizobium leguminosarum, as well as participating in horizontal gene transfer. Although phages that infect this species have been isolated in the past, our knowledge of their molecular biology, and especially of genome composition, is extremely limited, and this lack of information impacts on the ability to assess phage population dynamics and limits potential agricultural applications of rhizobiophages. To help address this deficit in available sequence and biological information, the complete genome sequence of the Myoviridae temperate phage PPF1 that infects R. leguminosarum biovar viciae strain F1 was determined. The genome is 54,506 bp in length with an average G+C content of 61.9 %. The genome contains 94 putative open reading frames (ORFs) and 74.5 % of these predicted ORFs share homology at the protein level with previously reported sequences in the database. However, putative functions could only be assigned to 25.5 % (24 ORFs) of the predicted genes. PPF1 was capable of efficiently lysogenizing its rhizobial host R. leguminosarum F1. The site-specific recombination system of the phage targets an integration site that lies within a putative tRNA-Pro (CGG) gene in R. leguminosarum F1. Upon integration, the phage is capable of restoring the disrupted tRNA gene, owing to the 50 bp homologous sequence (att core region) it shares with its rhizobial host genome. Phage PPF1 is the first temperate phage infecting members of the genus Rhizobium for which a complete genome sequence, as well as other biological data such as the integration site, is available.
Biology of Symbioses between Marine Invertebrates and Intracellular Bacteria
1991-01-21
bisphosphate carboxylase ( RubisCO ) from symbiotic bacteria of various origins, b) To continue methods development for 16S rRNA sequencing from symbionts in...frozen and badly preserved specimens, and c) To use these new techniques to sequence 16s DNA from a variety of symbionts a) RubisCO We have cloned the...gene coding for RubisCO from the sulfur oxidixing symbiont of the gastropod Alvinochoncha hessleri. Nucleotide sequence analysis of the cloned fragment
NASA Astrophysics Data System (ADS)
Kamide, Norihiro; Kaneiwa, Ken
An extended full computation-tree logic, CTLS*, is introduced as a Kripke semantics with a sequence modal operator. This logic can appropriately represent hierarchical tree structures where sequence modal operators in CTLS* are applied to tree structures. An embedding theorem of CTLS* into CTL* is proved. The validity, satisfiability and model-checking problems of CTLS* are shown to be decidable. An illustrative example of biological taxonomy is presented using CTLS* formulas.
Human Chromosome 7: DNA Sequence and Biology
Scherer, Stephen W.; Cheung, Joseph; MacDonald, Jeffrey R.; Osborne, Lucy R.; Nakabayashi, Kazuhiko; Herbrick, Jo-Anne; Carson, Andrew R.; Parker-Katiraee, Layla; Skaug, Jennifer; Khaja, Razi; Zhang, Junjun; Hudek, Alexander K.; Li, Martin; Haddad, May; Duggan, Gavin E.; Fernandez, Bridget A.; Kanematsu, Emiko; Gentles, Simone; Christopoulos, Constantine C.; Choufani, Sanaa; Kwasnicka, Dorota; Zheng, Xiangqun H.; Lai, Zhongwu; Nusskern, Deborah; Zhang, Qing; Gu, Zhiping; Lu, Fu; Zeesman, Susan; Nowaczyk, Malgorzata J.; Teshima, Ikuko; Chitayat, David; Shuman, Cheryl; Weksberg, Rosanna; Zackai, Elaine H.; Grebe, Theresa A.; Cox, Sarah R.; Kirkpatrick, Susan J.; Rahman, Nazneen; Friedman, Jan M.; Heng, Henry H. Q.; Pelicci, Pier Giuseppe; Lo-Coco, Francesco; Belloni, Elena; Shaffer, Lisa G.; Pober, Barbara; Morton, Cynthia C.; Gusella, James F.; Bruns, Gail A. P.; Korf, Bruce R.; Quade, Bradley J.; Ligon, Azra H.; Ferguson, Heather; Higgins, Anne W.; Leach, Natalia T.; Herrick, Steven R.; Lemyre, Emmanuelle; Farra, Chantal G.; Kim, Hyung-Goo; Summers, Anne M.; Gripp, Karen W.; Roberts, Wendy; Szatmari, Peter; Winsor, Elizabeth J. T.; Grzeschik, Karl-Heinz; Teebi, Ahmed; Minassian, Berge A.; Kere, Juha; Armengol, Lluis; Pujana, Miguel Angel; Estivill, Xavier; Wilson, Michael D.; Koop, Ben F.; Tosi, Sabrina; Moore, Gudrun E.; Boright, Andrew P.; Zlotorynski, Eitan; Kerem, Batsheva; Kroisel, Peter M.; Petek, Erwin; Oscier, David G.; Mould, Sarah J.; Döhner, Hartmut; Döhner, Konstanze; Rommens, Johanna M.; Vincent, John B.; Venter, J. Craig; Li, Peter W.; Mural, Richard J.; Adams, Mark D.; Tsui, Lap-Chee
2010-01-01
DNA sequence and annotation of the entire human chromosome 7, encompassing nearly 158 million nucleotides of DNA and 1917 gene structures, are presented. To generate a higher order description, additional structural features such as imprinted genes, fragile sites, and segmental duplications were integrated at the level of the DNA sequence with medical genetic data, including 440 chromosome rearrangement breakpoints associated with disease. This approach enabled the discovery of candidate genes for developmental diseases including autism. PMID:12690205
Complete Genome Sequence of Bifidobacterium bifidum S17▿
Zhurina, Daria; Zomer, Aldert; Gleinser, Marita; Brancaccio, Vincenco Francesco; Auchter, Marc; Waidmann, Mark S.; Westermann, Christina; van Sinderen, Douwe; Riedel, Christian U.
2011-01-01
Here, we report on the first completely annotated genome sequence of a Bifidobacterium bifidum strain. B. bifidum S17, isolated from feces of a breast-fed infant, was shown to strongly adhere to intestinal epithelial cells and has potent anti-inflammatory activity in vitro and in vivo. The genome sequence will provide new insights into the biology of this potential probiotic organism and allow for the characterization of the molecular mechanisms underlying its beneficial properties. PMID:21037011
2012-01-01
Background As a human replacement, the crab-eating macaque (Macaca fascicularis) is an invaluable non-human primate model for biomedical research, but the lack of genetic information on this primate has represented a significant obstacle for its broader use. Results Here, we sequenced the transcriptome of 16 tissues originated from two individuals of crab-eating macaque (male and female), and identified genes to resolve the main obstacles for understanding the biological response of the crab-eating macaque. From 4 million reads with 1.4 billion base sequences, 31,786 isotigs containing genes similar to those of humans, 12,672 novel isotigs, and 348,160 singletons were identified using the GS FLX sequencing method. Approximately 86% of human genes were represented among the genes sequenced in this study. Additionally, 175 tissue-specific transcripts were identified, 81 of which were experimentally validated. In total, 4,314 alternative splicing (AS) events were identified and analyzed. Intriguingly, 10.4% of AS events were associated with transposable element (TE) insertions. Finally, investigation of TE exonization events and evolutionary analysis were conducted, revealing interesting phenomena of human-specific amplified trends in TE exonization events. Conclusions This report represents the first large-scale transcriptome sequencing and genetic analyses of M. fascicularis and could contribute to its utility for biomedical research and basic biology. PMID:22554259
Blanden, Melanie J; Suazo, Kiall F; Hildebrandt, Emily R; Hardgrove, Daniel S; Patel, Meet; Saunders, William P; Distefano, Mark D; Schmidt, Walter K; Hougland, James L
2018-02-23
Protein prenylation is a post-translational modification that has been most commonly associated with enabling protein trafficking to and interaction with cellular membranes. In this process, an isoprenoid group is attached to a cysteine near the C terminus of a substrate protein by protein farnesyltransferase (FTase) or protein geranylgeranyltransferase type I or II (GGTase-I and GGTase-II). FTase and GGTase-I have long been proposed to specifically recognize a four-amino acid C AAX C-terminal sequence within their substrates. Surprisingly, genetic screening reveals that yeast FTase can modify sequences longer than the canonical C AAX sequence, specifically C( x ) 3 X sequences with four amino acids downstream of the cysteine. Biochemical and cell-based studies using both peptide and protein substrates reveal that mammalian FTase orthologs can also prenylate C( x ) 3 X sequences. As the search to identify physiologically relevant C( x ) 3 X proteins begins, this new prenylation motif nearly doubles the number of proteins within the yeast and human proteomes that can be explored as potential FTase substrates. This work expands our understanding of prenylation's impact within the proteome, establishes the biologically relevant reactivity possible with this new motif, and opens new frontiers in determining the impact of non-canonically prenylated proteins on cell function. © 2018 by The American Society for Biochemistry and Molecular Biology, Inc.
Cattaneo, Luigi; Fasanelli, Monica; Andreatta, Olaf; Bonifati, Domenico Marco; Barchiesi, Guido; Caruana, Fausto
2012-03-01
Empirical evidence indicates that cognitive consequences of cerebellar lesions tend to be mild and less important than the symptoms due to lesions to cerebral areas. By contrast, imaging studies consistently report strong cerebellar activity during tasks of action observation and action understanding. This has been interpreted as part of the automatic motor simulation process that takes place in the context of action observation. The function of the cerebellum as a sequencer during executed movements makes it a good candidate, within the framework of embodied cognition, for a pivotal role in understanding the timing of action sequences. Here, we investigated a cohort of eight patients with chronic, first-ever, isolated, ischemic lesions of the cerebellum. The experimental task consisted in identifying a plausible sequence of pictures from a randomly ordered group of still frames extracted from (a) a complex action performed by a human actor ("biological action" test) or (b) a complex physical event occurring to an inanimate object ("folk physics" test). A group of 16 healthy participants was used as control. The main result showed that cerebellar patients performed significantly worse than controls in both sequencing tasks, but performed much worse in the "biological action" test than in the "folk physics" test. The dissociation described here suggests that observed sequences of simple motor acts seem to be represented differentially from other sequences in the cerebellum.
Company profile: Complete Genomics Inc.
Reid, Clifford
2011-02-01
Complete Genomics Inc. is a life sciences company that focuses on complete human genome sequencing. It is taking a completely different approach to DNA sequencing than other companies in the industry. Rather than building a general-purpose platform for sequencing all organisms and all applications, it has focused on a single application - complete human genome sequencing. The company's Complete Genomics Analysis Platform (CGA™ Platform) comprises an integrated package of biochemistry, instrumentation and software that sequences human genomes at the highest quality, lowest cost and largest scale available. Complete Genomics offers a turnkey service that enables customers to outsource their human genome sequencing to the company's genome sequencing center in Mountain View, CA, USA. Customers send in their DNA samples, the company does all the library preparation, DNA sequencing, assembly and variant analysis, and customers receive research-ready data that they can use for biological discovery.
Kim, Jejoong; Park, Sohee; Blake, Randolph
2011-01-01
Background Anomalous visual perception is a common feature of schizophrenia plausibly associated with impaired social cognition that, in turn, could affect social behavior. Past research suggests impairment in biological motion perception in schizophrenia. Behavioral and functional magnetic resonance imaging (fMRI) experiments were conducted to verify the existence of this impairment, to clarify its perceptual basis, and to identify accompanying neural concomitants of those deficits. Methodology/Findings In Experiment 1, we measured ability to detect biological motion portrayed by point-light animations embedded within masking noise. Experiment 2 measured discrimination accuracy for pairs of point-light biological motion sequences differing in the degree of perturbation of the kinematics portrayed in those sequences. Experiment 3 measured BOLD signals using event-related fMRI during a biological motion categorization task. Compared to healthy individuals, schizophrenia patients performed significantly worse on both the detection (Experiment 1) and discrimination (Experiment 2) tasks. Consistent with the behavioral results, the fMRI study revealed that healthy individuals exhibited strong activation to biological motion, but not to scrambled motion in the posterior portion of the superior temporal sulcus (STSp). Interestingly, strong STSp activation was also observed for scrambled or partially scrambled motion when the healthy participants perceived it as normal biological motion. On the other hand, STSp activation in schizophrenia patients was not selective to biological or scrambled motion. Conclusion Schizophrenia is accompanied by difficulties discriminating biological from non-biological motion, and associated with those difficulties are altered patterns of neural responses within brain area STSp. The perceptual deficits exhibited by schizophrenia patients may be an exaggerated manifestation of neural events within STSp associated with perceptual errors made by healthy observers on these same tasks. The present findings fit within the context of theories of delusion involving perceptual and cognitive processes. PMID:21625492
NASA Astrophysics Data System (ADS)
Serra, Reviewed By Martin J.
2000-01-01
Genomics is one of the most rapidly expanding areas of science. This book is an outgrowth of a series of lectures given by one of the former heads (CRC) of the Human Genome Initiative. The book is designed to reach a wide audience, from biologists with little chemical or physical science background through engineers, computer scientists, and physicists with little current exposure to the chemical or biological principles of genetics. The text starts with a basic review of the chemical and biological properties of DNA. However, without either a biochemistry background or a supplemental biochemistry text, this chapter and much of the rest of the text would be difficult to digest. The second chapter is designed to put DNA into the context of the larger chromosomal unit. Specialized chromosomal structures and sequences (centromeres, telomeres) are introduced, leading to a section on chromosome organization and purification. The next 4 chapters cover the physical (hybridization, electrophoresis), chemical (polymerase chain reaction), and biological (genetic) techniques that provide the backbone of genomic analysis. These chapters cover in significant detail the fundamental principles underlying each technique and provide a firm background for the remainder of the text. Chapters 79 consider the need and methods for the development of physical maps. Chapter 7 primarily discusses chromosomal localization techniques, including in situ hybridization, FISH, and chromosome paintings. The next two chapters focus on the development of libraries and clones. In particular, Chapter 9 considers the limitations of current mapping and clone production. The current state and future of DNA sequencing is covered in the next three chapters. The first considers the current methods of DNA sequencing - especially gel-based methods of analysis, although other possible approaches (mass spectrometry) are introduced. Much of the chapter addresses the limitations of current methods, including analysis of error in sequencing and current bottlenecks in the sequencing effort. The next chapter describes the steps necessary to scale current technologies for the sequencing of entire genomes. Chapter 12 examines alternate methods for DNA sequencing. Initially, methods of single-molecule sequencing and sequencing by microscopy are introduced; the majority of the chapter is devoted to the development of DNA sequencing methods using chip microarrays and hybridization. The remaining chapters (13-15) consider the uses and analysis of DNA sequence information. The initial focus is on the identification of genes. Several examples are given of the use of DNA sequence information for diagnosis of inherited or infectious diseases. The sequence-specific manipulation of DNA is discussed in Chapter 14. The final chapter deals with the implications of large-scale sequencing, including methods for identifying genes and finding errors in DNA sequences, to the development of computer algorithms for the interpretation of DNA sequence information. The text figures are black and white line drawings that, although clearly done, seem a bit primitive for 1999. While I appreciated the simplicity of the drawings, many students accustomed to more colorful presentations will find them wanting. The four color figures in the center of the text seem an afterthought and add little to the text's clarity. Each chapter has a set of additional reading sources, mostly primary sources. Often, specialized topics are offset into boxes that provide clarification and amplification without cluttering the text. An appendix includes a list of the Web-based database resources. As an undergraduate instructor who has previously taught biochemistry, molecular biology, and a course on the human genome, I found many interesting tidbits and amplifications throughout the text. I would recommend this book as a text for an advanced undergraduate or beginning graduate course in genomics. Although the text works though several examples of genetic and genome analysis, additional problem/homework sets would need to be developed to ensure student comprehension. The text steers clear of the ethical implications of the Human Genome Initiative and remains true to its subtitle The Science and Technology .
Graph mining for next generation sequencing: leveraging the assembly graph for biological insights.
Warnke-Sommer, Julia; Ali, Hesham
2016-05-06
The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current assembly tools is to assemble NGS reads into contiguous stretches of sequence called contigs while maximizing for both accuracy and contig length. The end goal of this process is to produce longer contigs with the major focus being on assembly only. Sequence read assembly is an aggregative process, during which read overlap relationship information is lost as reads are merged into longer sequences or contigs. The assembly graph is information rich and capable of capturing the genomic architecture of an input read data set. We have developed a novel hybrid graph in which nodes represent sequence regions at different levels of granularity. This model, utilized in the assembly and analysis pipeline Focus, presents a concise yet feature rich view of a given input data set, allowing for the extraction of biologically relevant graph structures for graph mining purposes. Focus was used to create hybrid graphs to model metagenomics data sets obtained from the gut microbiomes of five individuals with Crohn's disease and eight healthy individuals. Repetitive and mobile genetic elements are found to be associated with hybrid graph structure. Using graph mining techniques, a comparative study of the Crohn's disease and healthy data sets was conducted with focus on antibiotics resistance genes associated with transposase genes. Results demonstrated significant differences in the phylogenetic distribution of categories of antibiotics resistance genes in the healthy and diseased patients. Focus was also evaluated as a pure assembly tool and produced excellent results when compared against the Meta-velvet, Omega, and UD-IDBA assemblers. Mining the hybrid graph can reveal biological phenomena captured by its structure. We demonstrate the advantages of considering assembly graphs as data-mining support in addition to their role as frameworks for assembly.
RDNAnalyzer: A tool for DNA secondary structure prediction and sequence analysis.
Afzal, Muhammad; Shahid, Ahmad Ali; Shehzadi, Abida; Nadeem, Shahid; Husnain, Tayyab
2012-01-01
RDNAnalyzer is an innovative computer based tool designed for DNA secondary structure prediction and sequence analysis. It can randomly generate the DNA sequence or user can upload the sequences of their own interest in RAW format. It uses and extends the Nussinov dynamic programming algorithm and has various application for the sequence analysis. It predicts the DNA secondary structure and base pairings. It also provides the tools for routinely performed sequence analysis by the biological scientists such as DNA replication, reverse compliment generation, transcription, translation, sequence specific information as total number of nucleotide bases, ATGC base contents along with their respective percentages and sequence cleaner. RDNAnalyzer is a unique tool developed in Microsoft Visual Studio 2008 using Microsoft Visual C# and Windows Presentation Foundation and provides user friendly environment for sequence analysis. It is freely available. http://www.cemb.edu.pk/sw.html RDNAnalyzer - Random DNA Analyser, GUI - Graphical user interface, XAML - Extensible Application Markup Language.
Snake Genome Sequencing: Results and Future Prospects
Kerkkamp, Harald M. I.; Kini, R. Manjunatha; Pospelov, Alexey S.; Vonk, Freek J.; Henkel, Christiaan V.; Richardson, Michael K.
2016-01-01
Snake genome sequencing is in its infancy—very much behind the progress made in sequencing the genomes of humans, model organisms and pathogens relevant to biomedical research, and agricultural species. We provide here an overview of some of the snake genome projects in progress, and discuss the biological findings, with special emphasis on toxinology, from the small number of draft snake genomes already published. We discuss the future of snake genomics, pointing out that new sequencing technologies will help overcome the problem of repetitive sequences in assembling snake genomes. Genome sequences are also likely to be valuable in examining the clustering of toxin genes on the chromosomes, in designing recombinant antivenoms and in studying the epigenetic regulation of toxin gene expression. PMID:27916957
Snake Genome Sequencing: Results and Future Prospects.
Kerkkamp, Harald M I; Kini, R Manjunatha; Pospelov, Alexey S; Vonk, Freek J; Henkel, Christiaan V; Richardson, Michael K
2016-12-01
Snake genome sequencing is in its infancy-very much behind the progress made in sequencing the genomes of humans, model organisms and pathogens relevant to biomedical research, and agricultural species. We provide here an overview of some of the snake genome projects in progress, and discuss the biological findings, with special emphasis on toxinology, from the small number of draft snake genomes already published. We discuss the future of snake genomics, pointing out that new sequencing technologies will help overcome the problem of repetitive sequences in assembling snake genomes. Genome sequences are also likely to be valuable in examining the clustering of toxin genes on the chromosomes, in designing recombinant antivenoms and in studying the epigenetic regulation of toxin gene expression.
Evolution of microbes and viruses: a paradigm shift in evolutionary biology?
Koonin, Eugene V.; Wolf, Yuri I.
2012-01-01
When Charles Darwin formulated the central principles of evolutionary biology in the Origin of Species in 1859 and the architects of the Modern Synthesis integrated these principles with population genetics almost a century later, the principal if not the sole objects of evolutionary biology were multicellular eukaryotes, primarily animals and plants. Before the advent of efficient gene sequencing, all attempts to extend evolutionary studies to bacteria have been futile. Sequencing of the rRNA genes in thousands of microbes allowed the construction of the three- domain “ribosomal Tree of Life” that was widely thought to have resolved the evolutionary relationships between the cellular life forms. However, subsequent massive sequencing of numerous, complete microbial genomes revealed novel evolutionary phenomena, the most fundamental of these being: (1) pervasive horizontal gene transfer (HGT), in large part mediated by viruses and plasmids, that shapes the genomes of archaea and bacteria and call for a radical revision (if not abandonment) of the Tree of Life concept, (2) Lamarckian-type inheritance that appears to be critical for antivirus defense and other forms of adaptation in prokaryotes, and (3) evolution of evolvability, i.e., dedicated mechanisms for evolution such as vehicles for HGT and stress-induced mutagenesis systems. In the non-cellular part of the microbial world, phylogenomics and metagenomics of viruses and related selfish genetic elements revealed enormous genetic and molecular diversity and extremely high abundance of viruses that come across as the dominant biological entities on earth. Furthermore, the perennial arms race between viruses and their hosts is one of the defining factors of evolution. Thus, microbial phylogenomics adds new dimensions to the fundamental picture of evolution even as the principle of descent with modification discovered by Darwin and the laws of population genetics remain at the core of evolutionary biology. PMID:22993722
New Biological Sciences, Sociology and Education
ERIC Educational Resources Information Center
Youdell, Deborah
2016-01-01
Since the Human Genome Project mapped the gene sequence, new biological sciences have been generating a raft of new knowledges about the mechanisms and functions of the molecular body. One area of work that has particular potential to speak to sociology of education, is the emerging field of epigenetics. Epigenetics moves away from the mapped…
ERIC Educational Resources Information Center
Smith, Jason T.; Harris, Justine C.; Lopez, Oscar J.; Valverde, Laura; Borchert, Glen M.
2015-01-01
The sequencing of whole genomes and the analysis of genetic information continues to fundamentally change biological and medical research. Unfortunately, the people best suited to interpret this data (biologically trained researchers) are commonly discouraged by their own perceived computational limitations. To address this, we developed a course…
Hypervariable minisatellites: recombinators or innocent bystanders?
Jarman, A P; Wells, R A
1989-11-01
It has become apparent in recent years that unexpectedly large numbers of minisatellites exist within the eukaryotic genome. Their use in genetics is well known, but as with any new class of sequence, there is also much speculation about their involvement in a range of biological processes. How much is known of their biology?
Genomic science provides new insights into the biology of forest trees
Andrew Groover
2015-01-01
Forest biology is undergoing a fundamental change fostered by the application of genomic science to longstanding questions surrounding the evolution, adaptive traits, development, and environmental interactions of tree species. Genomic science has made major technical leaps in recent years, most notably with the advent of 'next generation sequencing' but...
ERIC Educational Resources Information Center
Zhang, Xiaorong
2011-01-01
We incorporated a bioinformatics component into the freshman biology course that allows students to explore cystic fibrosis (CF), a common genetic disorder, using bioinformatics tools and skills. Students learn about CF through searching genetic databases, analyzing genetic sequences, and observing the three-dimensional structures of proteins…
USDA-ARS?s Scientific Manuscript database
Progress in studying the biology of Trichinella spp. was greatly advanced with the publication and analysis of the draft genome sequence of T. spiralis. Those data provide a basis for constructing testable hypothesis concerning parasite physiology, immunology, and genetics. They also provide tools...
Leveraging the rice genome sequence for monocot comparative and translational genomics.
Lohithaswa, H C; Feltus, F A; Singh, H P; Bacon, C D; Bailey, C D; Paterson, A H
2007-07-01
Common genome anchor points across many taxa greatly facilitate translational and comparative genomics and will improve our understanding of the Tree of Life. To add to the repertoire of genomic tools applicable to the study of monocotyledonous plants in general, we aligned Allium and Musa ESTs to Oryza BAC sequences and identified candidate Allium-Oryza and Musa-Oryza conserved intron-scanning primers (CISPs). A random sampling of 96 CISP primer pairs, representing loci from 11 of the 12 chromosomes in rice, were tested on seven members of the order Poales and on representatives of the Arecales, Asparagales, and Zingiberales monocot orders. The single-copy amplification success rates of Allium (31.3%), Cynodon (31.4%), Hordeum (30.2%), Musa (37.5%), Oryza (61.5%), Pennisetum (33.3%), Sorghum (47.9%), Zea (33.3%), Triticum (30.2%), and representatives of the palm family (32.3%) suggest that subsets of these primers will provide DNA markers suitable for comparative and translational genomics in orphan crops, as well as for applications in conservation biology, ecology, invasion biology, population biology, systematic biology, and related fields.
Harkness, Robert W; Mittermaier, Anthony K
2017-11-01
G-quadruplexes (GQs) are four-stranded nucleic acid secondary structures formed by guanosine (G)-rich DNA and RNA sequences. It is becoming increasingly clear that cellular processes including gene expression and mRNA translation are regulated by GQs. GQ structures have been extensively characterized, however little attention to date has been paid to their conformational dynamics, despite the fact that many biological GQ sequences populate multiple structures of similar free energies, leading to an ensemble of exchanging conformations. The impact of these dynamics on biological function is currently not well understood. Recently, structural dynamics have been demonstrated to entropically stabilize GQ ensembles, potentially modulating gene expression. Transient, low-populated states in GQ ensembles may additionally regulate nucleic acid interactions and function. This review will underscore the interplay of GQ dynamics and biological function, focusing on several dynamic processes for biological GQs and the characterization of GQ dynamics by nuclear magnetic resonance (NMR) spectroscopy in conjunction with other biophysical techniques. This article is part of a Special Issue entitled: Biophysics in Canada, edited by Lewis Kay, John Baenziger, Albert Berghuis and Peter Tieleman. Copyright © 2017 Elsevier B.V. All rights reserved.
Mapping the patent landscape of synthetic biology for fine chemical production pathways.
Carbonell, Pablo; Gök, Abdullah; Shapira, Philip; Faulon, Jean-Loup
2016-09-01
A goal of synthetic biology bio-foundries is to innovate through an iterative design/build/test/learn pipeline. In assessing the value of new chemical production routes, the intellectual property (IP) novelty of the pathway is important. Exploratory studies can be carried using knowledge of the patent/IP landscape for synthetic biology and metabolic engineering. In this paper, we perform an assessment of pathways as potential targets for chemical production across the full catalogue of reachable chemicals in the extended metabolic space of chassis organisms, as computed by the retrosynthesis-based algorithm RetroPath. Our database for reactions processed by sequences in heterologous pathways was screened against the PatSeq database, a comprehensive collection of more than 150M sequences present in patent grants and applications. We also examine related patent families using Derwent Innovations. This large-scale computational study provides useful insights into the IP landscape of synthetic biology for fine and specialty chemicals production. © 2016 The Authors. Microbial Biotechnology published by John Wiley & Sons Ltd and Society for Applied Microbiology.
Del Medico, Luca; Christen, Heinz; Christen, Beat
2017-01-01
Recent advances in lower-cost DNA synthesis techniques have enabled new innovations in the field of synthetic biology. Still, efficient design and higher-order assembly of genome-scale DNA constructs remains a labor-intensive process. Given the complexity, computer assisted design tools that fragment large DNA sequences into fabricable DNA blocks are needed to pave the way towards streamlined assembly of biological systems. Here, we present the Genome Partitioner software implemented as a web-based interface that permits multi-level partitioning of genome-scale DNA designs. Without the need for specialized computing skills, biologists can submit their DNA designs to a fully automated pipeline that generates the optimal retrosynthetic route for higher-order DNA assembly. To test the algorithm, we partitioned a 783 kb Caulobacter crescentus genome design. We validated the partitioning strategy by assembling a 20 kb test segment encompassing a difficult to synthesize DNA sequence. Successful assembly from 1 kb subblocks into the 20 kb segment highlights the effectiveness of the Genome Partitioner for reducing synthesis costs and timelines for higher-order DNA assembly. The Genome Partitioner is broadly applicable to translate DNA designs into ready to order sequences that can be assembled with standardized protocols, thus offering new opportunities to harness the diversity of microbial genomes for synthetic biology applications. The Genome Partitioner web tool can be accessed at https://christenlab.ethz.ch/GenomePartitioner. PMID:28531174
Jones, Roger A C; Kehoe, Monica A
2016-07-01
Current approaches used to name within-species, plant virus phylogenetic groups are often misleading and illogical. They involve names based on biological properties, sequence differences and geographical, country or place-association designations, or any combination of these. This type of nomenclature is becoming increasingly unsustainable as numbers of sequences of the same virus from new host species and different parts of the world increase. Moreover, this increase is accelerating as world trade and agriculture expand, and climate change progresses. Serious consequences for virus research and disease management might arise from incorrect assumptions made when current within-species phylogenetic group names incorrectly identify properties of group members. This could result in development of molecular tools that incorrectly target dangerous virus strains, potentially leading to unjustified impediments to international trade or failure to prevent such strains being introduced to countries, regions or continents formerly free of them. Dangerous strains might be missed or misdiagnosed by diagnostic laboratories and monitoring programs, and new cultivars with incorrect strain-specific resistances released. Incorrect deductions are possible during phylogenetic analysis of plant virus sequences and errors from strain misidentification during molecular and biological virus research activities. A nomenclature system for within-species plant virus phylogenetic group names is needed which avoids such problems. We suggest replacing all other naming approaches with Latinized numerals, restricting biologically based names only to biological strains and removing geographically based names altogether. Our recommendations have implications for biosecurity authorities, diagnostic laboratories, disease-management programs, plant breeders and researchers.
A Novel Center Star Multiple Sequence Alignment Algorithm Based on Affine Gap Penalty and K-Band
NASA Astrophysics Data System (ADS)
Zou, Quan; Shan, Xiao; Jiang, Yi
Multiple sequence alignment is one of the most important topics in computational biology, but it cannot deal with the large data so far. As the development of copy-number variant(CNV) and Single Nucleotide Polymorphisms(SNP) research, many researchers want to align numbers of similar sequences for detecting CNV and SNP. In this paper, we propose a novel multiple sequence alignment algorithm based on affine gap penalty and k-band. It can align more quickly and accurately, that will be helpful for mining CNV and SNP. Experiments prove the performance of our algorithm.
Worley, K C; Wiese, B A; Smith, R F
1995-09-01
BEAUTY (BLAST enhanced alignment utility) is an enhanced version of the NCBI's BLAST data base search tool that facilitates identification of the functions of matched sequences. We have created new data bases of conserved regions and functional domains for protein sequences in NCBI's Entrez data base, and BEAUTY allows this information to be incorporated directly into BLAST search results. A Conserved Regions Data Base, containing the locations of conserved regions within Entrez protein sequences, was constructed by (1) clustering the entire data base into families, (2) aligning each family using our PIMA multiple sequence alignment program, and (3) scanning the multiple alignments to locate the conserved regions within each aligned sequence. A separate Annotated Domains Data Base was constructed by extracting the locations of all annotated domains and sites from sequences represented in the Entrez, PROSITE, BLOCKS, and PRINTS data bases. BEAUTY performs a BLAST search of those Entrez sequences with conserved regions and/or annotated domains. BEAUTY then uses the information from the Conserved Regions and Annotated Domains data bases to generate, for each matched sequence, a schematic display that allows one to directly compare the relative locations of (1) the conserved regions, (2) annotated domains and sites, and (3) the locally aligned regions matched in the BLAST search. In addition, BEAUTY search results include World-Wide Web hypertext links to a number of external data bases that provide a variety of additional types of information on the function of matched sequences. This convenient integration of protein families, conserved regions, annotated domains, alignment displays, and World-Wide Web resources greatly enhances the biological informativeness of sequence similarity searches. BEAUTY searches can be performed remotely on our system using the "BCM Search Launcher" World-Wide Web pages (URL is < http:/ /gc.bcm.tmc.edu:8088/ search-launcher/launcher.html > ).
SALAD database: a motif-based database of protein annotations for plant comparative genomics
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named ‘SALAD on ARRAYs’ to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis. PMID:19854933
RNA regulators responding to ribosomal protein S15 are frequent in sequence space
Slinger, Betty L.; Meyer, Michelle M.
2016-01-01
There are several natural examples of distinct RNA structures that interact with the same ligand to regulate the expression of homologous genes in different organisms. One essential question regarding this phenomenon is whether such RNA regulators are the result of convergent or divergent evolution. Are the RNAs derived from some common ancestor and diverged to the point where we cannot identify the similarity, or have multiple solutions to the same biological problem arisen independently? A key variable in assessing these alternatives is how frequently such regulators arise within sequence space. Ribosomal protein S15 is autogenously regulated via an RNA regulator in many bacterial species; four apparently distinct regulators have been functionally validated in different bacterial phyla. Here, we explore how frequently such regulators arise within a partially randomized sequence population. We find many RNAs that interact specifically with ribosomal protein S15 from Geobacillus kaustophilus with biologically relevant dissociation constants. Furthermore, of the six sequences we characterize, four show regulatory activity in an Escherichia coli reporter assay. Subsequent footprinting and mutagenesis analysis indicates that protein binding proximal to regulatory features such as the Shine–Dalgarno sequence is sufficient to enable regulation, suggesting that regulation in response to S15 is relatively easily acquired. PMID:27580716
Novel green tissue-specific synthetic promoters and cis-regulatory elements in rice.
Wang, Rui; Zhu, Menglin; Ye, Rongjian; Liu, Zuoxiong; Zhou, Fei; Chen, Hao; Lin, Yongjun
2015-12-11
As an important part of synthetic biology, synthetic promoter has gradually become a hotspot in current biology. The purposes of the present study were to synthesize green tissue-specific promoters and to discover green tissue-specific cis-elements. We first assembled several regulatory sequences related to tissue-specific expression in different combinations, aiming to obtain novel green tissue-specific synthetic promoters. GUS assays of the transgenic plants indicated 5 synthetic promoters showed green tissue-specific expression patterns and different expression efficiencies in various tissues. Subsequently, we scanned and counted the cis-elements in different tissue-specific promoters based on the plant cis-elements database PLACE and the rice cDNA microarray database CREP for green tissue-specific cis-element discovery, resulting in 10 potential cis-elements. The flanking sequence of one potential core element (GEAT) was predicted by bioinformatics. Then, the combination of GEAT and its flanking sequence was functionally identified with synthetic promoter. GUS assays of the transgenic plants proved its green tissue-specificity. Furthermore, the function of GEAT flanking sequence was analyzed in detail with site-directed mutagenesis. Our study provides an example for the synthesis of rice tissue-specific promoters and develops a feasible method for screening and functional identification of tissue-specific cis-elements with their flanking sequences at the genome-wide level in rice.
Effect of hot acid hydrolysis and hot chlorine dioxide stage on bleaching effluent biodegradability.
Gomes, C M; Colodette, J L; Delantonio, N R N; Mounteer, A H; Silva, C M
2007-01-01
The hot acid hydrolysis followed by chlorine dioxide (A/D*) and hot chlorine dioxide (D*) technologies have proven very useful for bleaching of eucalyptus kraft pulp. Although the characteristics and biodegradability of effluents from conventional chlorine dioxide bleaching are well known, such information is not yet available for effluents derived from hot acid hydrolysis and hot chorine dioxide bleaching. This study discusses the characteristics and biodegradability of such effluents. Combined whole effluents from the complete sequences DEpD, D*EpD, A/D*EpD and ADEpD, and from the pre-bleaching sequences DEp, D*Ep, A/D*Ep and ADEp were characterized by quantifying their colour, AOX and organic load (BOD, COD, TOC). These effluents were also evaluated for their treatability by simulation of an activated sludge system. It was concluded that treatment in the laboratory sequencing batch reactor was efficient for removal of COD, BOD and TOC of all effluents. However, colour increased after biological treatment, with the greatest increase found for the effluent produced using the AD technology. Biological treatment was less efficient at removing AOX of effluents from the sequences with D*, A/D* and AD as the first stages, when compared to the reference D stage; there was evidence of the lower treatability of these organochlorine compounds from these sequences.
Mining SNPs from EST sequences using filters and ensemble classifiers.
Wang, J; Zou, Q; Guo, M Z
2010-05-04
Abundant single nucleotide polymorphisms (SNPs) provide the most complete information for genome-wide association studies. However, due to the bottleneck of manual discovery of putative SNPs and the inaccessibility of the original sequencing reads, it is essential to develop a more efficient and accurate computational method for automated SNP detection. We propose a novel computational method to rapidly find true SNPs in public-available EST (expressed sequence tag) databases; this method is implemented as SNPDigger. EST sequences are clustered and aligned. SNP candidates are then obtained according to a measure of redundant frequency. Several new informative biological features, such as the structural neighbor profiles and the physical position of the SNP, were extracted from EST sequences, and the effectiveness of these features was demonstrated. An ensemble classifier, which employs a carefully selected feature set, was included for the imbalanced training data. The sensitivity and specificity of our method both exceeded 80% for human genetic data in the cross validation. Our method enables detection of SNPs from the user's own EST dataset and can be used on species for which there is no genome data. Our tests showed that this method can effectively guide SNP discovery in ESTs and will be useful to avoid and save the cost of biological analyses.
Epstein-Barr Virus Sequence Variation—Biology and Disease
Tzellos, Stelios; Farrell, Paul J.
2012-01-01
Some key questions in Epstein-Barr virus (EBV) biology center on whether naturally occurring sequence differences in the virus affect infection or EBV associated diseases. Understanding the pattern of EBV sequence variation is also important for possible development of EBV vaccines. At present EBV isolates worldwide can be grouped into Type 1 and Type 2, a classification based on the EBNA2 gene sequence. Type 1 EBV is the most prevalent worldwide but Type 2 is common in parts of Africa. Type 1 transforms human B cells into lymphoblastoid cell lines much more efficiently than Type 2 EBV. Molecular mechanisms that may account for this difference in cell transformation are now becoming clearer. Advances in sequencing technology will greatly increase the amount of whole EBV genome data for EBV isolated from different parts of the world. Study of regional variation of EBV strains independent of the Type 1/Type 2 classification and systematic investigation of the relationship between viral strains, infection and disease will become possible. The recent discovery that specific mutation of the EBV EBNA3B gene may be linked to development of diffuse large B cell lymphoma illustrates the importance that mutations in the virus genome may have in infection and human disease. PMID:25436768
Pruitt, Wendy M.; Robinson, Lucy C.
2008-01-01
Research based laboratory courses have been shown to stimulate student interest in science and to improve scientific skills. We describe here a project developed for a semester-long research-based laboratory course that accompanies a genetics lecture course. The project was designed to allow students to become familiar with the use of bioinformatics tools and molecular biology and genetic approaches while carrying out original research. Students were required to present their hypotheses, experiments, and results in a comprehensive lab report. The lab project concerned the yeast casein kinase 1 (CK1) protein kinase Yck2. CK1 protein kinases are present in all organisms and are well conserved in primary structure. These enzymes display sequence features that differ from other protein kinase subfamilies. Students identified such sequences within the CK1 subfamily, chose a sequence to analyze, used available structural data to determine possible functions for their sequences, and designed mutations within the sequences. After generating the mutant alleles, these were expressed in yeast and tested for function by using two growth assays. The student response to the project was positive, both in terms of knowledge and skills increases and interest in research, and several students are continuing the analysis of mutant alleles as summer projects. PMID:19047427
NASA Astrophysics Data System (ADS)
Streets, Aaron M.; Cao, Chen; Zhang, Xiannian; Huang, Yanyi
2016-03-01
Phenotype classification of single cells reveals biological variation that is masked in ensemble measurement. This heterogeneity is found in gene and protein expression as well as in cell morphology. Many techniques are available to probe phenotypic heterogeneity at the single cell level, for example quantitative imaging and single-cell RNA sequencing, but it is difficult to perform multiple assays on the same single cell. In order to directly track correlation between morphology and gene expression at the single cell level, we developed a microfluidic platform for quantitative coherent Raman imaging and immediate RNA sequencing (RNA-Seq) of single cells. With this device we actively sort and trap cells for analysis with stimulated Raman scattering microscopy (SRS). The cells are then processed in parallel pipelines for lysis, and preparation of cDNA for high-throughput transcriptome sequencing. SRS microscopy offers three-dimensional imaging with chemical specificity for quantitative analysis of protein and lipid distribution in single cells. Meanwhile, the microfluidic platform facilitates single-cell manipulation, minimizes contamination, and furthermore, provides improved RNA-Seq detection sensitivity and measurement precision, which is necessary for differentiating biological variability from technical noise. By combining coherent Raman microscopy with RNA sequencing, we can better understand the relationship between cellular morphology and gene expression at the single-cell level.
Harper, B; McClain, S; Ganko, E W
2012-08-01
Global regulatory agencies require bioinformatic sequence analysis as part of their safety evaluation for transgenic crops. Analysis typically focuses on encoded proteins and adjacent endogenous flanking sequences. Recently, regulatory expectations have expanded to include all reading frames of the inserted DNA. The intent is to provide biologically relevant results that can be used in the overall assessment of safety. This paper evaluates the relevance of assessing the allergenic potential of all DNA reading frames found in common food genes using methods considered for the analysis of T-DNA sequences used in transgenic crops. FASTA and BLASTX algorithms were used to compare genes from maize, rice, soybean, cucumber, melon, watermelon, and tomato using international regulatory guidance. Results show that BLASTX for maize yielded 7254 alignments that exceeded allergen similarity thresholds and 210,772 alignments that matched eight or more consecutive amino acids with an allergen; other crops produced similar results. This analysis suggests that each nontransgenic crop has a much greater potential for allergenic risk than what has been observed clinically. We demonstrate that a meaningful safety assessment is unlikely to be provided by using methods with inherently high frequencies of false positive alignments when broadly applied to all reading frames of DNA sequence. Copyright © 2012 Elsevier Inc. All rights reserved.
SALAD database: a motif-based database of protein annotations for plant comparative genomics.
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209,529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named 'SALAD on ARRAYs' to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis.
PlantRNA, a database for tRNAs of photosynthetic eukaryotes.
Cognat, Valérie; Pawlak, Gaël; Duchêne, Anne-Marie; Daujat, Magali; Gigant, Anaïs; Salinas, Thalia; Michaud, Morgane; Gutmann, Bernard; Giegé, Philippe; Gobert, Anthony; Maréchal-Drouard, Laurence
2013-01-01
PlantRNA database (http://plantrna.ibmp.cnrs.fr/) compiles transfer RNA (tRNA) gene sequences retrieved from fully annotated plant nuclear, plastidial and mitochondrial genomes. The set of annotated tRNA gene sequences has been manually curated for maximum quality and confidence. The novelty of this database resides in the inclusion of biological information relevant to the function of all the tRNAs entered in the library. This includes 5'- and 3'-flanking sequences, A and B box sequences, region of transcription initiation and poly(T) transcription termination stretches, tRNA intron sequences, aminoacyl-tRNA synthetases and enzymes responsible for tRNA maturation and modification. Finally, data on mitochondrial import of nuclear-encoded tRNAs as well as the bibliome for the respective tRNAs and tRNA-binding proteins are also included. The current annotation concerns complete genomes from 11 organisms: five flowering plants (Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Medicago truncatula and Brachypodium distachyon), a moss (Physcomitrella patens), two green algae (Chlamydomonas reinhardtii and Ostreococcus tauri), one glaucophyte (Cyanophora paradoxa), one brown alga (Ectocarpus siliculosus) and a pennate diatom (Phaeodactylum tricornutum). The database will be regularly updated and implemented with new plant genome annotations so as to provide extensive information on tRNA biology to the research community.
The Alveolate Perkinsus marinus: Biological Insights from EST Gene Discovery
2010-01-01
Background Perkinsus marinus, a protozoan parasite of the eastern oyster Crassostrea virginica, has devastated natural and farmed oyster populations along the Atlantic and Gulf coasts of the United States. It is classified as a member of the Perkinsozoa, a recently established phylum considered close to the ancestor of ciliates, dinoflagellates, and apicomplexans, and a key taxon for understanding unique adaptations (e.g. parasitism) within the Alveolata. Despite intense parasite pressure, no disease-resistant oysters have been identified and no effective therapies have been developed to date. Results To gain insight into the biological basis of the parasite's virulence and pathogenesis mechanisms, and to identify genes encoding potential targets for intervention, we generated >31,000 5' expressed sequence tags (ESTs) derived from four trophozoite libraries generated from two P. marinus strains. Trimming and clustering of the sequence tags yielded 7,863 unique sequences, some of which carry a spliced leader. Similarity searches revealed that 55% of these had hits in protein sequence databases, of which 1,729 had their best hit with proteins from the chromalveolates (E-value ≤ 1e-5). Some sequences are similar to those proven to be targets for effective intervention in other protozoan parasites, and include not only proteases, antioxidant enzymes, and heat shock proteins, but also those associated with relict plastids, such as acetyl-CoA carboxylase and methyl erythrithol phosphate pathway components, and those involved in glycan assembly, protein folding/secretion, and parasite-host interactions. Conclusions Our transcriptome analysis of P. marinus, the first for any member of the Perkinsozoa, contributes new insight into its biology and taxonomic position. It provides a very informative, albeit preliminary, glimpse into the expression of genes encoding functionally relevant proteins as potential targets for chemotherapy, and evidence for the presence of a relict plastid. Further, although P. marinus sequences display significant similarity to those from both apicomplexans and dinoflagellates, the presence of trans-spliced transcripts confirms the previously established affinities with the latter. The EST analysis reported herein, together with the recently completed sequence of the P. marinus genome and the development of transfection methodology, should result in improved intervention strategies against dermo disease. PMID:20374649
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.
Wang, Chunlin; Lefkowitz, Elliot J
2004-10-28
Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist.
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters
Wang, Chunlin; Lefkowitz, Elliot J
2004-01-01
Background Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. Results We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Conclusions Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist. PMID:15511296
Currin, Andrew; Swainston, Neil; Day, Philip J.
2015-01-01
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the ‘search space’ of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (K d) and catalytic (k cat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving k cat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the ‘best’ amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust. PMID:25503938
On avoided words, absent words, and their application to biological sequence analysis.
Almirantis, Yannis; Charalampopoulos, Panagiotis; Gao, Jia; Iliopoulos, Costas S; Mohamed, Manal; Pissis, Solon P; Polychronopoulos, Dimitris
2017-01-01
The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided . This concept is particularly useful in DNA linguistic analysis. The value of the deviation of w , denoted by [Formula: see text], effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length [Formula: see text] is a [Formula: see text]-avoided word in x if [Formula: see text], for a given threshold [Formula: see text]. Notice that such a word may be completely absent from x . Hence, computing all such words naïvely can be a very time-consuming procedure, in particular for large k . In this article, we propose an [Formula: see text]-time and [Formula: see text]-space algorithm to compute all [Formula: see text]-avoided words of length k in a given sequence of length n over a fixed-sized alphabet. We also present a time-optimal [Formula: see text]-time algorithm to compute all [Formula: see text]-avoided words (of any length) in a sequence of length n over an integer alphabet of size [Formula: see text]. In addition, we provide a tight asymptotic upper bound for the number of [Formula: see text]-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency and applicability of our implementation in biological sequence analysis. The systematic search for avoided words is particularly useful for biological sequence analysis. We present a linear-time and linear-space algorithm for the computation of avoided words of length k in a given sequence x . We suggest a modification to this algorithm so that it computes all avoided words of x , irrespective of their length, within the same time complexity. We also present combinatorial results with regards to avoided words and absent words.
Wang, Dongping; Ries, Tessa R.; Pierson, Leland S.; Pierson, Elizabeth A.
2018-01-01
Phenazines are bacterial secondary metabolites and play important roles in the antagonistic activity of the biological control strain P. chlororaphis 30–84 against take-all disease of wheat. The expression of the P. chlororaphis 30–84 phenazine biosynthetic operon (phzXYFABCD) is dependent on the PhzR/PhzI quorum sensing system located immediately upstream of the biosynthetic operon as well as other regulatory systems including Gac/Rsm. Bioinformatic analysis of the sequence between the divergently oriented phzR and phzX promoters identified features within the 5’-untranslated region (5’-UTR) of phzX that are conserved only among 2OHPCA producing Pseudomonas. The conserved sequence features are potentially capable of producing secondary structures that negatively modulate one or both promoters. Transcriptional and translational fusion assays revealed that deletion of 90-bp of sequence at the 5’-UTR of phzX led to up to 4-fold greater expression of the reporters with the deletion compared to the controls, which indicated this sequence negatively modulates phenazine gene expression both transcriptionally and translationally. This 90-bp sequence was deleted from the P. chlororaphis 30–84 chromosome, resulting in 30-84Enh, which produces significantly more phenazine than the wild-type while retaining quorum sensing control. The transcriptional expression of phzR/phzI and amount of AHL signal produced by 30-84Enh also were significantly greater than for the wild-type, suggesting this 90-bp sequence also negatively affects expression of the quorum sensing genes. In addition, deletion of the 90-bp partially relieved RsmE-mediated translational repression, indicating a role for Gac/RsmE interaction. Compared to the wild-type, enhanced phenazine production by 30-84Enh resulted in improvement in fungal inhibition, biofilm formation, extracellular DNA release and suppression of take-all disease of wheat in soil without negative consequences on growth or rhizosphere persistence. This work provides greater insight into the regulation of phenazine biosynthesis with potential applications for improved biological control. PMID:29451920
A Linked Series of Laboratory Exercises in Molecular Biology Utilizing Bioinformatics and GFP
ERIC Educational Resources Information Center
Medin, Carey L.; Nolin, Katie L.
2011-01-01
Molecular biologists commonly use bioinformatics to map and analyze DNA and protein sequences and to align different DNA and protein sequences for comparison. Additionally, biologists can create and view 3D models of protein structures to further understand intramolecular interactions. The primary goal of this 10-week laboratory was to introduce…
Xu, Guogang; Hu, Juan; Fang, Xiangqun; Zhang, Xuelin; Wang, Junfeng; Guo, Yinghua; Li, Tianzhi; Chen, Zhenghong; Dai, Wenkui; Liu, Changting
2014-03-13
To explore the changes of Pseudomonas aeruginosa in space flight, we present the draft genome sequence of P. aeruginosa strain LCT-PA220, which originated from a P. aeruginosa strain, ATCC 27853, that traveled on the Shenzhou-VIII spacecraft.
ERIC Educational Resources Information Center
American Association of Physics Teachers (NJ1), 2009
2009-01-01
Physics First represents an organizational alternative to the traditional high school science sequence. It calls for a re-sequencing of high school courses so that students study physics before chemistry and biology. The purpose of this pamphlet is to provide: (1) Basic information and rationale for the Physics First curriculum; (2) Strategies for…
Kelly Ivors; Matteo Garbelotto; Ineke De Vries; Peter Bonants
2006-01-01
Investigating the population genetics of Phytophthora ramorum, the causal agent of sudden oak death (SOD), is critical to understanding the biology and epidemiology of this important phytopathogen. Raw sequence data (445,000 reads) of P. ramorum was provided by the Joint Genome Institute. Our objective was to develop and utilize...
ERIC Educational Resources Information Center
Griffin, Vernetta; McMiller, Tracee; Jones, Erika; Johnson, Casonya M.
2003-01-01
A 14-week, undergraduate-level Genetics and Population Biology course at Morgan State University was modified to include a demonstration of functional genomics in the research laboratory. Students performed a rudimentary sequence analysis of the "Caenorhabditis elegans" genome and further characterized three sequences that were predicted to encode…
The Numbers Speak: Physics First Supports Math Performance
ERIC Educational Resources Information Center
Glasser, Howard M.
2012-01-01
More schools in the United States have begun teaching physics to ninth-graders, but there continues to be limited evidence that such a change benefits students. Many arguments in favor of Physics First and the inverted sequence of physics-chemistry-biology are based more on the intellectual logic of the sequence than on measured outcomes. Paul…
Cloning of the poly(ADP-ribose) Gene from Rat Liver.
1986-09-24
Levinson, Ph.D. (Cetus Corp., Berkeley). 5. Amino acid analysis done in UCSF Bioanal. Lab. TABLE OF CONTENTS Page METHOD I...TABLE I ............. ............................... ... 12 Proteolytic degradation, isolation of peptide and amino acid sequences...technique developed for enzyme quantitation in biological materials. The amino- acid sequence of the enzyme has so far been determined because the amino
New developments in ancient genomics.
Millar, Craig D; Huynen, Leon; Subramanian, Sankar; Mohandesan, Elmira; Lambert, David M
2008-07-01
Ancient DNA research is on the crest of a 'third wave' of progress due to the introduction of a new generation of DNA sequencing technologies. Here we review the advantages and disadvantages of the four new DNA sequencers that are becoming available to researchers. These machines now allow the recovery of orders of magnitude more DNA sequence data, albeit as short sequence reads. Hence, the potential reassembly of complete ancient genomes seems imminent, and when used to screen libraries of ancient sequences, these methods are cost effective. This new wealth of data is also likely to herald investigations into the functional properties of extinct genes and gene complexes and will improve our understanding of the biological basis of extinct phenotypes.
Nawrocki, Eric P.; Burge, Sarah W.
2013-01-01
The development of RNA bioinformatic tools began more than 30 y ago with the description of the Nussinov and Zuker dynamic programming algorithms for single sequence RNA secondary structure prediction. Since then, many tools have been developed for various RNA sequence analysis problems such as homology search, multiple sequence alignment, de novo RNA discovery, read-mapping, and many more. In this issue, we have collected a sampling of reviews and original research that demonstrate some of the many ways bioinformatics is integrated with current RNA biology research. PMID:23948768
Valenzuela, Nicole
2009-07-01
Painted turtles (Chrysemys picta) are representatives of a vertebrate clade whose biology and phylogenetic position hold a key to our understanding of fundamental aspects of vertebrate evolution. These features make them an ideal emerging model system. Extensive ecological and physiological research provide the context in which to place new research advances in evolutionary genetics, genomics, evolutionary developmental biology, and ecological developmental biology which are enabled by current resources, such as a bacterial artificial chromosome (BAC) library of C. picta, and the imminent development of additional ones such as genome sequences and cDNA and expressed sequence tag (EST) libraries. This integrative approach will allow the research community to continue making advances to provide functional and evolutionary explanations for the lability of biological traits found not only among reptiles but vertebrates in general. Moreover, because humans and reptiles share a common ancestor, and given the ease of using nonplacental vertebrates in experimental biology compared with mammalian embryos, painted turtles are also an emerging model system for biomedical research. For example, painted turtles have been studied to understand many biological responses to overwintering and anoxia, as potential sentinels for environmental xenobiotics, and as a model to decipher the ecology and evolution of sexual development and reproduction. Thus, painted turtles are an excellent reptilian model system for studies with human health, environmental, ecological, and evolutionary significance.
NCBI-compliant genome submissions: tips and tricks to save time and money.
Pirovano, Walter; Boetzer, Marten; Derks, Martijn F L; Smit, Sandra
2017-03-01
Genome sequences nowadays play a central role in molecular biology and bioinformatics. These sequences are shared with the scientific community through sequence databases. The sequence repositories of the International Nucleotide Sequence Database Collaboration (INSDC, comprising GenBank, ENA and DDBJ) are the largest in the world. Preparing an annotated sequence in such a way that it will be accepted by the database is challenging because many validation criteria apply. In our opinion, it is an undesirable situation that researchers who want to submit their sequence need either a lot of experience or help from partners to get the job done. To save valuable time and money, we list a number of recommendations for people who want to submit an annotated genome to a sequence database, as well as for tool developers, who could help to ease the process. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Solid phase sequencing of biopolymers
Cantor, Charles; Koster, Hubert
2010-09-28
This invention relates to methods for detecting and sequencing target nucleic acid sequences, to mass modified nucleic acid probes and arrays of probes useful in these methods, and to kits and systems which contain these probes. Useful methods involve hybridizing the nucleic acids or nucleic acids which represent complementary or homologous sequences of the target to an array of nucleic acid probes. These probes comprise a single-stranded portion, an optional double-stranded portion and a variable sequence within the single-stranded portion. The molecular weights of the hybridized nucleic acids of the set can be determined by mass spectroscopy, and the sequence of the target determined from the molecular weights of the fragments. Nucleic acids whose sequences can be determined include DNA or RNA in biological samples such as patient biopsies and environmental samples. Probes may be fixed to a solid support such as a hybridization chip to facilitate automated molecular weight analysis and identification of the target sequence.
ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.
Zeng, Victor; Extavour, Cassandra G
2012-01-01
The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental biology, physiology, evolutionary biology, ecology, comparative genomics and phylogenomics. Database URL: asgard.rc.fas.harvard.edu.
Bertelli, Claire; Aeby, Sébastien; Chassot, Bérénice; Clulow, James; Hilfiker, Olivier; Rappo, Samuel; Ritzmann, Sébastien; Schumacher, Paolo; Terrettaz, Céline; Benaglio, Paola; Falquet, Laurent; Farinelli, Laurent; Gharib, Walid H; Goesmann, Alexander; Harshman, Keith; Linke, Burkhard; Miyazaki, Ryo; Rivolta, Carlo; Robinson-Rechavi, Marc; van der Meer, Jan Roelof; Greub, Gilbert
2015-01-01
With the widespread availability of high-throughput sequencing technologies, sequencing projects have become pervasive in the molecular life sciences. The huge bulk of data generated daily must be analyzed further by biologists with skills in bioinformatics and by "embedded bioinformaticians," i.e., bioinformaticians integrated in wet lab research groups. Thus, students interested in molecular life sciences must be trained in the main steps of genomics: sequencing, assembly, annotation and analysis. To reach that goal, a practical course has been set up for master students at the University of Lausanne: the "Sequence a genome" class. At the beginning of the academic year, a few bacterial species whose genome is unknown are provided to the students, who sequence and assemble the genome(s) and perform manual annotation. Here, we report the progress of the first class from September 2010 to June 2011 and the results obtained by seven master students who specifically assembled and annotated the genome of Estrella lausannensis, an obligate intracellular bacterium related to Chlamydia. The draft genome of Estrella is composed of 29 scaffolds encompassing 2,819,825 bp that encode for 2233 putative proteins. Estrella also possesses a 9136 bp plasmid that encodes for 14 genes, among which we found an integrase and a toxin/antitoxin module. Like all other members of the Chlamydiales order, Estrella possesses a highly conserved type III secretion system, considered as a key virulence factor. The annotation of the Estrella genome also allowed the characterization of the metabolic abilities of this strictly intracellular bacterium. Altogether, the students provided the scientific community with the Estrella genome sequence and a preliminary understanding of the biology of this recently-discovered bacterial genus, while learning to use cutting-edge technologies for sequencing and to perform bioinformatics analyses.
Shrimankar, D D; Sathe, S R
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes
NASA Astrophysics Data System (ADS)
Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.
2011-10-01
The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.