Science.gov

Sample records for accurate functional annotation

  1. Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences.

    PubMed

    Sharma, Ashok K; Gupta, Ankit; Kumar, Sanjiv; Dhakan, Darshan B; Sharma, Vineet K

    2015-07-01

    Functional annotation of the gigantic metagenomic data is one of the major time-consuming and computationally demanding tasks, which is currently a bottleneck for the efficient analysis. The commonly used homology-based methods to functionally annotate and classify proteins are extremely slow. Therefore, to achieve faster and accurate functional annotation, we have developed an orthology-based functional classifier 'Woods' by using a combination of machine learning and similarity-based approaches. Woods displayed a precision of 98.79% on independent genomic dataset, 96.66% on simulated metagenomic dataset and >97% on two real metagenomic datasets. In addition, it performed >87 times faster than BLAST on the two real metagenomic datasets. Woods can be used as a highly efficient and accurate classifier with high-throughput capability which facilitates its usability on large metagenomic datasets. PMID:25863333

  2. Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities.

    PubMed

    Venner, Eric; Lisewski, Andreas Martin; Erdin, Serkan; Ward, R Matthew; Amin, Shivas R; Lichtarge, Olivier

    2010-01-01

    High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks.

  3. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG

  4. Algal functional annotation tool

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations tomore » interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on

  5. Algal functional annotation tool

    SciTech Connect

    Lopez, D.; Casero, D.; Cokus, S. J.; Merchant, S. S.; Pellegrini, M.

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  6. PSSP-RFE: Accurate Prediction of Protein Structural Class by Recursive Feature Extraction from PSI-BLAST Profile, Physical-Chemical Property and Functional Annotations

    PubMed Central

    Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets. PMID:24675610

  7. Functional annotation of hypothetical proteins - A review.

    PubMed

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2006-12-29

    The complete human genome sequences in the public database provide ways to understand the blue print of life. As of June 29, 2006, 27 archaeal, 326 bacterial and 21 eukaryotes is complete genomes are available and the sequencing for 316 bacterial, 24 archaeal, 126 eukaryotic genomes are in progress. The traditional biochemical/molecular experiments can assign accurate functions for genes in these genomes. However, the process is time-consuming and costly. Despite several efforts, only 50-60 % of genes have been annotated in most completely sequenced genomes. Automated genome sequence analysis and annotation may provide ways to understand genomes. Thus, determination of protein function is one of the challenging problems of the post-genome era. This demands bioinformatics to predict functions of un-annotated protein sequences by developing efficient tools. Here, we discuss some of the recent and popular approaches developed in Bioinformatics to predict functions for hypothetical proteins.

  8. Effective function annotation through catalytic residue conservation.

    PubMed

    George, Richard A; Spriggs, Ruth V; Bartlett, Gail J; Gutteridge, Alex; MacArthur, Malcolm W; Porter, Craig T; Al-Lazikani, Bissan; Thornton, Janet M; Swindells, Mark B

    2005-08-30

    Because of the extreme impact of genome sequencing projects, protein sequences without accompanying experimental data now dominate public databases. Homology searches, by providing an opportunity to transfer functional information between related proteins, have become the de facto way to address this. Although a single, well annotated, close relationship will often facilitate sufficient annotation, this situation is not always the case, particularly if mutations are present in important functional residues. When only distant relationships are available, the transfer of function information is more tenuous, and the likelihood of encountering several well annotated proteins with different functions is increased. The consequence for a researcher is a range of candidate functions with little way of knowing which, if any, are correct. Here, we address the problem directly by introducing a computational approach to accurately identify and segregate related proteins into those with a functional similarity and those where function differs. This approach should find a wide range of applications, including the interpretation of genomics/proteomics data and the prioritization of targets for high-throughput structure determination. The method is generic, but here we concentrate on enzymes and apply high-quality catalytic site data. In addition to providing a series of comprehensive benchmarks to show the overall performance of our approach, we illustrate its utility with specific examples that include the correct identification of haptoglobin as a nonenzymatic relative of trypsin, discrimination of acid-d-amino acid ligases from a much larger ligase pool, and the successful annotation of BioH, a structural genomics target.

  9. Facilitating functional annotation of chicken microarray data

    PubMed Central

    2009-01-01

    Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO). However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM) tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and will be updated on regular

  10. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated

  11. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    PubMed

    Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  12. Functional annotation of hypothetical proteins – A review

    PubMed Central

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2006-01-01

    The complete human genome sequences in the public database provide ways to understand the blue print of life. As of June 29, 2006, 27 archaeal, 326 bacterial and 21 eukaryotes is complete genomes are available and the sequencing for 316 bacterial, 24 archaeal, 126 eukaryotic genomes are in progress. The traditional biochemical/molecular experiments can assign accurate functions for genes in these genomes. However, the process is time-consuming and costly. Despite several efforts, only 50-60 % of genes have been annotated in most completely sequenced genomes. Automated genome sequence analysis and annotation may provide ways to understand genomes. Thus, determination of protein function is one of the challenging problems of the post-genome era. This demands bioinformatics to predict functions of un-annotated protein sequences by developing efficient tools. Here, we discuss some of the recent and popular approaches developed in Bioinformatics to predict functions for hypothetical proteins. PMID:17597916

  13. Functional Annotations of Paralogs: A Blessing and a Curse

    PubMed Central

    Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  14. Functional Annotations of Paralogs: A Blessing and a Curse.

    PubMed

    Zallot, Rémi; Harrison, Katherine J; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  15. A Machine Learning Approach for Accurate Annotation of Noncoding RNAs.

    PubMed

    Song, Yinglei; Liu, Chunmei; Wang, Zhi

    2015-01-01

    Searching genomes to locate noncoding RNA genes with known secondary structure is an important problem in bioinformatics. In general, the secondary structure of a searched noncoding RNA is defined with a structure model constructed from the structural alignment of a set of sequences from its family. Computing the optimal alignment between a sequence and a structure model is the core part of an algorithm that can search genomes for noncoding RNAs. In practice, a single structure model may not be sufficient to capture all crucial features important for a noncoding RNA family. In this paper, we develop a novel machine learning approach that can efficiently search genomes for noncoding RNAs with high accuracy. During the search procedure, a sequence segment in the searched genome sequence is processed and a feature vector is extracted to represent it. Based on the feature vector, a classifier is used to determine whether the sequence segment is the searched ncRNA or not. Our testing results show that this approach is able to efficiently capture crucial features of a noncoding RNA family. Compared with existing search tools, it significantly improves the accuracy of genome annotation. PMID:26357266

  16. A Machine Learning Approach for Accurate Annotation of Noncoding RNAs

    PubMed Central

    Liu, Chunmei; Wang, Zhi

    2016-01-01

    Searching genomes to locate noncoding RNA genes with known secondary structure is an important problem in bioinformatics. In general, the secondary structure of a searched noncoding RNA is defined with a structure model constructed from the structural alignment of a set of sequences from its family. Computing the optimal alignment between a sequence and a structure model is the core part of an algorithm that can search genomes for noncoding RNAs. In practice, a single structure model may not be sufficient to capture all crucial features important for a noncoding RNA family. In this paper, we develop a novel machine learning approach that can efficiently search genomes for noncoding RNAs with high accuracy. During the search procedure, a sequence segment in the searched genome sequence is processed and a feature vector is extracted to represent it. Based on the feature vector, a classifier is used to determine whether the sequence segment is the searched ncRNA or not. Our testing results show that this approach is able to efficiently capture crucial features of a noncoding RNA family. Compared with existing search tools, it significantly improves the accuracy of genome annotation. PMID:26357266

  17. Functional Annotation Analytics of Rhodopseudomonas palustris Genomes

    PubMed Central

    Simmons, Shaneka S.; Isokpehi, Raphael D.; Brown, Shyretha D.; McAllister, Donee L.; Hall, Charnia C.; McDuffy, Wanaki M.; Medley, Tamara L.; Udensi, Udensi K.; Rajnarayanan, Rajendram V.; Ayensu, Wellington K.; Cohly, Hari H.P.

    2011-01-01

    Rhodopseudomonas palustris, a nonsulphur purple photosynthetic bacteria, has been extensively investigated for its metabolic versatility including ability to produce hydrogen gas from sunlight and biomass. The availability of the finished genome sequences of six R. palustris strains (BisA53, BisB18, BisB5, CGA009, HaA2 and TIE-1) combined with online bioinformatics software for integrated analysis presents new opportunities to determine the genomic basis of metabolic versatility and ecological lifestyles of the bacteria species. The purpose of this investigation was to compare the functional annotations available for multiple R. palustris genomes to identify annotations that can be further investigated for strain-specific or uniquely shared phenotypic characteristics. A total of 2,355 protein family Pfam domain annotations were clustered based on presence or absence in the six genomes. The clustering process identified groups of functional annotations including those that could be verified as strain-specific or uniquely shared phenotypes. For example, genes encoding water/glycerol transport were present in the genome sequences of strains CGA009 and BisB5, but absent in strains BisA53, BisB18, HaA2 and TIE-1. Protein structural homology modeling predicted that the two orthologous 240 aa R. palustris aquaporins have water-specific transport function. Based on observations in other microbes, the presence of aquaporin in R. palustris strains may improve freeze tolerance in natural conditions of rapid freezing such as nitrogen fixation at low temperatures where access to liquid water is a limiting factor for nitrogenase activation. In the case of adaptive loss of aquaporin genes, strains may be better adapted to survive in conditions of high-sugar content such as fermentation of biomass for biohydrogen production. Finally, web-based resources were developed to allow for interactive, user-defined selection of the relationship between protein family annotations and the R

  18. UP-TORR: online tool for accurate and Up-to-Date annotation of RNAi Reagents.

    PubMed

    Hu, Yanhui; Roesel, Charles; Flockhart, Ian; Perkins, Lizabeth; Perrimon, Norbert; Mohr, Stephanie E

    2013-09-01

    RNA interference (RNAi) is a widely adopted tool for loss-of-function studies but RNAi results only have biological relevance if the reagents are appropriately mapped to genes. Several groups have designed and generated RNAi reagent libraries for studies in cells or in vivo for Drosophila and other species. At first glance, matching RNAi reagents to genes appears to be a simple problem, as each reagent is typically designed to target a single gene. In practice, however, the reagent-gene relationship is complex. Although the sequences of oligonucleotides used to generate most types of RNAi reagents are static, the reference genome and gene annotations are regularly updated. Thus, at the time a researcher chooses an RNAi reagent or analyzes RNAi data, the most current interpretation of the RNAi reagent-gene relationship, as well as related information regarding specificity (e.g., predicted off-target effects), can be different from the original interpretation. Here, we describe a set of strategies and an accompanying online tool, UP-TORR (for Updated Targets of RNAi Reagents; www.flyrnai.org/up-torr), useful for accurate and up-to-date annotation of cell-based and in vivo RNAi reagents. Importantly, UP-TORR automatically synchronizes with gene annotations daily, retrieving the most current information available, and for Drosophila, also synchronizes with the major reagent collections. Thus, UP-TORR allows users to choose the most appropriate RNAi reagents at the onset of a study, as well as to perform the most appropriate analyses of results of RNAi-based studies.

  19. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation.

    PubMed

    Sharma, Virag; Elghafari, Anas; Hiller, Michael

    2016-06-20

    Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes. PMID:27016733

  20. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation

    PubMed Central

    Bunin, Barry A.; Litterman, Nadia K.; Schürer, Stephan C.; Visser, Ubbo

    2014-01-01

    Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers. PMID:25165633

  1. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation.

    PubMed

    Clark, Alex M; Bunin, Barry A; Litterman, Nadia K; Schürer, Stephan C; Visser, Ubbo

    2014-01-01

    Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers. PMID:25165633

  2. Accurate Transposable Element Annotation Is Vital When Analyzing New Genome Assemblies

    PubMed Central

    Platt, Roy N.; Blanco-Berdugo, Laura; Ray, David A.

    2016-01-01

    Transposable elements (TEs) are mobile genetic elements with the ability to replicate themselves throughout the host genome. In some taxa TEs reach copy numbers in hundreds of thousands and can occupy more than half of the genome. The increasing number of reference genomes from nonmodel species has begun to outpace efforts to identify and annotate TE content and methods that are used vary significantly between projects. Here, we demonstrate variation that arises in TE annotations when less than optimal methods are used. We found that across a variety of taxa, the ability to accurately identify TEs based solely on homology decreased as the phylogenetic distance between the queried genome and a reference increased. Next we annotated repeats using homology alone, as is often the case in new genome analyses, and a combination of homology and de novo methods as well as an additional manual curation step. Reannotation using these methods identified a substantial number of new TE subfamilies in previously characterized genomes, recognized a higher proportion of the genome as repetitive, and decreased the average genetic distance within TE families, implying recent TE accumulation. Finally, these finding—increased recognition of younger TEs—were confirmed via an analysis of the postman butterfly (Heliconius melpomene). These observations imply that complete TE annotation relies on a combination of homology and de novo–based repeat identification, manual curation, and classification and that relying on simple, homology-based methods is insufficient to accurately describe the TE landscape of a newly sequenced genome. PMID:26802115

  3. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  4. MitoFish and MitoAnnotator: a mitochondrial genome database of fish with an accurate and automatic annotation pipeline.

    PubMed

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-11-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

  5. Considerations to improve functional annotations in biological databases.

    PubMed

    Benítez-Páez, Alfonso

    2009-12-01

    Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community. PMID:20050264

  6. A Maximum-Entropy approach for accurate document annotation in the biomedical domain.

    PubMed

    Tsatsaronis, George; Macari, Natalia; Torge, Sunna; Dietze, Heiko; Schroeder, Michael

    2012-01-01

    The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm's performance is resilient to terms' ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.

  7. Automatic annotation of protein function based on family identification.

    PubMed

    Abascal, Federico; Valencia, Alfonso

    2003-11-15

    Although genomes are being sequenced at an impressive rate, the information generated tells us little about protein function, which is slow to characterize by traditional methods. Automatic protein function annotation based on computational methods has alleviated this imbalance. The most powerful current approach for inferring the function of new proteins is by studying the annotations of their homologues, since their common origin is assumed to be reflected in their structure and function. Unfortunately, as proteins evolve they acquire new functions, so annotation based on homology must be carried out in the context of orthologues or subfamilies. Evolution adds new complications through domain shuffling: homology (or orthology) frequently corresponds to domains rather than complete proteins. Moreover, the function of a protein may be seen as the result of combining the functions of its domains. Additionally, automatic annotation has to deal with problems related to the annotations in the databases: errors (which are likely to be propagated), inconsistencies, or different degrees of function specification. We describe a method that addresses these difficulties for the annotation of protein function. Sequence relationships are detected and measured to obtain a map of the sequence space, which is searched for differentiated groups of proteins (similar to islands on the map), which are expected to have a common function and correspond to groups of orthologues or subfamilies. This mapmaking is done by applying a clustering algorithm based on Normalized cuts in graphs. The domain problem is addressed in a simple way: pairwise local alignments are analyzed to determine the extent to which they cover the entire sequence lengths of the two proteins. This analysis determines both what homologues are preferred for functional inheritance and the level of confidence of the annotation. To alleviate the problems associated with database annotations, the information on all the

  8. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp.

  9. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads

    PubMed Central

    Song, Li; Sabunciyan, Sarven; Florea, Liliana

    2016-01-01

    Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to accurately capture splicing variation has proven difficult. Current programs can find major isoforms of a gene but miss lower abundance variants, or are sensitive but imprecise. CLASS2 is a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads based on the model of splice graph. An extension of our program CLASS, CLASS2 jointly optimizes read patterns and the number of supporting reads to score and prioritize transcripts, implemented in a novel, scalable and efficient dynamic programming algorithm. When compared against reference programs, CLASS2 had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool to produce consistently reliable transcript models for a wide range of applications and sequencing strategies, including ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS2 requires <3GB RAM and can analyze a 350 million read set within hours, and can be widely applied to transcriptomics studies ranging from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes. PMID:26975657

  10. Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation

    PubMed Central

    Ralph, Duncan K.; Matsen, Frederick A.

    2016-01-01

    VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM “factorization” strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM. PMID:26751373

  11. Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation.

    PubMed

    Ralph, Duncan K; Matsen, Frederick A

    2016-01-01

    VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM "factorization" strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM. PMID:26751373

  12. Biocuration of functional annotation at the European nucleotide archive

    PubMed Central

    Gibson, Richard; Alako, Blaise; Amid, Clara; Cerdeño-Tárraga, Ana; Cleland, Iain; Goodgame, Neil; ten Hoopen, Petra; Jayathilaka, Suran; Kay, Simon; Leinonen, Rasko; Liu, Xin; Pallreddy, Swapna; Pakseresht, Nima; Rajan, Jeena; Rosselló, Marc; Silvester, Nicole; Smirnov, Dmitriy; Toribio, Ana Luisa; Vaughan, Daniel; Zalunin, Vadim; Cochrane, Guy

    2016-01-01

    The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the submission, maintenance and presentation of nucleotide sequence data and related sample and experimental information. In this article we report on ENA in 2015 regarding general activity, notable published data sets and major achievements. This is followed by a focus on sustainable biocuration of functional annotation, an area which has particularly felt the pressure of sequencing growth. The importance of functional annotation, how it can be submitted and the shifting role of the biocurator in the context of increasing volumes of data are all discussed. PMID:26615190

  13. Assessment of protein set coherence using functional annotations

    PubMed Central

    Chagoyen, Monica; Carazo, Jose M; Pascual-Montano, Alberto

    2008-01-01

    Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at PMID:18937846

  14. Accurate density functional thermochemistry for larger molecules.

    SciTech Connect

    Raghavachari, K.; Stefanov, B. B.; Curtiss, L. A.; Lucent Tech.

    1997-06-20

    Density functional methods are combined with isodesmic bond separation reaction energies to yield accurate thermochemistry for larger molecules. Seven different density functionals are assessed for the evaluation of heats of formation, Delta H 0 (298 K), for a test set of 40 molecules composed of H, C, O and N. The use of bond separation energies results in a dramatic improvement in the accuracy of all the density functionals. The B3-LYP functional has the smallest mean absolute deviation from experiment (1.5 kcal mol/f).

  15. Functional annotation of colon cancer risk SNPs

    PubMed Central

    Yao, Lijing; Tak, Yu Gyoung; Berman, Benjamin P.; Farnham, Peggy J.

    2014-01-01

    Colorectal cancer (CRC) is a leading cause of cancer-related deaths in the United States. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for CRC. A molecular understanding of the functional consequences of this genetic variation has been complicated because each GWAS SNP is a surrogate for hundreds of other SNPs, most of which are located in non-coding regions. Here we use genomic and epigenomic information to test the hypothesis that the GWAS SNPs and/or correlated SNPs are in elements that regulate gene expression, and identify 23 promoters and 28 enhancers. Using gene expression data from normal and tumour cells, we identify 66 putative target genes of the risk-associated enhancers (10 of which were also identified by promoter SNPs). Employing CRISPR nucleases, we delete one risk-associated enhancer and identify genes showing altered expression. We suggest that similar studies be performed to characterize all CRC risk-associated enhancers. PMID:25268989

  16. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  17. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

  18. Functional annotation of non-coding sequence variants

    PubMed Central

    Ritchie, Graham R. S.; Dunham, Ian; Zeggini, Eleftheria; Flicek, Paul

    2016-01-01

    Identifying functionally relevant variants against the background of ubiquitous genetic variation is a major challenge in human genetics. For variants that fall in protein-coding regions our understanding of the genetic code and splicing allow us to identify likely candidates, but interpreting variants that fall outside of genic regions is more difficult. Here we present a new tool, GWAVA, which supports prioritisation of non-coding variants by integrating a range of annotations. PMID:24487584

  19. Gene3D: comprehensive structural and functional annotation of genomes.

    PubMed

    Yeats, Corin; Lees, Jonathan; Reid, Adam; Kellam, Paul; Martin, Nigel; Liu, Xinhui; Orengo, Christine

    2008-01-01

    Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein-protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/ PMID:18032434

  20. Eliciting the Functional Taxonomy from protein annotations and taxa.

    PubMed

    Falda, Marco; Lavezzo, Enrico; Fontana, Paolo; Bianco, Luca; Berselli, Michele; Formentin, Elide; Toppo, Stefano

    2016-08-18

    The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules.

  1. Eliciting the Functional Taxonomy from protein annotations and taxa

    PubMed Central

    Falda, Marco; Lavezzo, Enrico; Fontana, Paolo; Bianco, Luca; Berselli, Michele; Formentin, Elide; Toppo, Stefano

    2016-01-01

    The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules. PMID:27534507

  2. Eliciting the Functional Taxonomy from protein annotations and taxa.

    PubMed

    Falda, Marco; Lavezzo, Enrico; Fontana, Paolo; Bianco, Luca; Berselli, Michele; Formentin, Elide; Toppo, Stefano

    2016-01-01

    The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules. PMID:27534507

  3. Protein function annotation by local binding site surface similarity.

    PubMed

    Spitzer, Russell; Cleves, Ann E; Varela, Rocco; Jain, Ajay N

    2014-04-01

    Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against ∼60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that predated query protein biochemical annotation for five out of the eight query proteins. A panel of 12 currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins.

  4. Functional annotations in bacterial genomes based on small RNA signatures.

    PubMed

    Sridhar, Jayavel; Rafi, Ziauddin Ahamed

    2008-04-04

    One of the key challenges in computational genomics is annotating coding genes and identification of regulatory RNAs in complete genomes. An attempt is made in this study which uses the regulatory RNA locations and their conserved flanking genes identified within the genomic backbone of template genome to search for similar RNA locations in query genomes. The search is based on recently reported coexistence of small RNAs and their conserved flanking genes in related genomes. Based on our study, 54 additional sRNA locations and functions of 96 uncharacterized genes are predicted in two draft genomes viz., Serratia marcesens Db1 and Yersinia enterocolitica 8081. Although most of the identified additional small RNA regions and their corresponding flanking genes are homologous in nature, the proposed anchoring technique could successfully identify four non-homologous small RNA regions in Y. enterocolitica genome also. The KEGG Orthology (KO) based automated functional predictions confirms the predicted functions of 65 flanking genes having defined KO numbers, out of the total 96 predictions made by this method. This coexistence based method shows more sensitivity than controlled vocabularies in locating orthologous gene pairs even in the absence of defined Orthology numbers. All functional predictions made by this study in Y. enterocolitica 8081 were confirmed by the recently published complete genome sequence and annotations. This study also reports the possible regions of gene rearrangements in these two genomes and further characterization of such RNA regions could shed more light on their possible role in genome evolution.

  5. Functional annotations in bacterial genomes based on small RNA signatures

    PubMed Central

    Sridhar, Jayavel; Rafi, Ziauddin Ahamed

    2008-01-01

    One of the key challenges in computational genomics is annotating coding genes and identification of regulatory RNAs in complete genomes. An attempt is made in this study which uses the regulatory RNA locations and their conserved flanking genes identified within the genomic backbone of template genome to search for similar RNA locations in query genomes. The search is based on recently reported coexistence of small RNAs and their conserved flanking genes in related genomes. Based on our study, 54 additional sRNA locations and functions of 96 uncharacterized genes are predicted in two draft genomes viz., Serratia marcesens Db1 and Yersinia enterocolitica 8081. Although most of the identified additional small RNA regions and their corresponding flanking genes are homologous in nature, the proposed anchoring technique could successfully identify four non-homologous small RNA regions in Y. enterocolitica genome also. The KEGG Orthology (KO) based automated functional predictions confirms the predicted functions of 65 flanking genes having defined KO numbers, out of the total 96 predictions made by this method. This coexistence based method shows more sensitivity than controlled vocabularies in locating orthologous gene pairs even in the absence of defined Orthology numbers. All functional predictions made by this study in Y. enterocolitica 8081 were confirmed by the recently published complete genome sequence and annotations. This study also reports the possible regions of gene rearrangements in these two genomes and further characterization of such RNA regions could shed more light on their possible role in genome evolution. PMID:18478081

  6. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    SciTech Connect

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle

    2014-10-09

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accurate comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.

  7. Functional Annotation of Putative Regulatory Elements at Cancer Susceptibility Loci

    PubMed Central

    Rosse, Stephanie A; Auer, Paul L; Carlson, Christopher S

    2014-01-01

    Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions. PMID:25288875

  8. Functional annotation of rare gene aberration drivers of pancreatic cancer.

    PubMed

    Tsang, Yiu Huen; Dogruluk, Turgut; Tedeschi, Philip M; Wardwell-Ozgo, Joanna; Lu, Hengyu; Espitia, Maribel; Nair, Nikitha; Minelli, Rosalba; Chong, Zechen; Chen, Fengju; Chang, Qing Edward; Dennison, Jennifer B; Dogruluk, Armel; Li, Min; Ying, Haoqiang; Bertino, Joseph R; Gingras, Marie-Claude; Ittmann, Michael; Kerrigan, John; Chen, Ken; Creighton, Chad J; Eterovic, Karina; Mills, Gordon B; Scott, Kenneth L

    2016-01-01

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC). This approach reveals oncogenic activity for rare gene aberrations in genes including NAD Kinase (NADK), which regulates NADP(H) homeostasis and cellular redox state. We further validate mutant NADK, whose expression provides gain-of-function enzymatic activity leading to a reduction in cellular reactive oxygen species and tumorigenesis, and show that depletion of wild-type NADK in PDAC cell lines attenuates cancer cell growth in vitro and in vivo. These data indicate that annotating rare aberrations can reveal important cancer signalling pathways representing additional therapeutic targets. PMID:26806015

  9. Functional annotation of rare gene aberration drivers of pancreatic cancer

    PubMed Central

    Tsang, Yiu Huen; Dogruluk, Turgut; Tedeschi, Philip M.; Wardwell-Ozgo, Joanna; Lu, Hengyu; Espitia, Maribel; Nair, Nikitha; Minelli, Rosalba; Chong, Zechen; Chen, Fengju; Chang, Qing Edward; Dennison, Jennifer B.; Dogruluk, Armel; Li, Min; Ying, Haoqiang; Bertino, Joseph R.; Gingras, Marie-Claude; Ittmann, Michael; Kerrigan, John; Chen, Ken; Creighton, Chad J.; Eterovic, Karina; Mills, Gordon B.; Scott, Kenneth L.

    2016-01-01

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC). This approach reveals oncogenic activity for rare gene aberrations in genes including NAD Kinase (NADK), which regulates NADP(H) homeostasis and cellular redox state. We further validate mutant NADK, whose expression provides gain-of-function enzymatic activity leading to a reduction in cellular reactive oxygen species and tumorigenesis, and show that depletion of wild-type NADK in PDAC cell lines attenuates cancer cell growth in vitro and in vivo. These data indicate that annotating rare aberrations can reveal important cancer signalling pathways representing additional therapeutic targets. PMID:26806015

  10. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

    PubMed Central

    Das, Sayoni; Lee, David; Sillitoe, Ian; Dawson, Natalie L.; Lees, Jonathan G.; Orengo, Christine A.

    2015-01-01

    Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact: sayoni.das.12@ucl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26139634

  11. SIFTER search: a web server for accurate phylogeny-based protein function prediction.

    PubMed

    Sahraeian, Sayed M; Luo, Kevin R; Brenner, Steven E

    2015-07-01

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.

  12. CATH FunFHMMer web server: protein functional annotations using functional family assignments.

    PubMed

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G; Dawson, Natalie L; Ward, John; Orengo, Christine A

    2015-07-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.

  13. CATH FunFHMMer web server: protein functional annotations using functional family assignments

    PubMed Central

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G.; Dawson, Natalie L.; Ward, John; Orengo, Christine A.

    2015-01-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer. PMID:25964299

  14. Functional annotation of introns in mitochondrial genome--a brief review.

    PubMed

    Anandakumar, Shanmugam; Ravindran, Suda Parimala; Shanmughavel, Piramanayagam

    2016-01-01

    The present study is to decipher the non-coding regions present in mitochondrial genomes that cause diseases in humans and predict their functional roles through comparative genomics approach followed by functional annotation of these segments.

  15. A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS

    PubMed Central

    IONITA-LAZA, IULIANA; MCCALLUM, KENNETH; XU, BIN; BUXBAUM, JOSEPH

    2015-01-01

    Over the past few years, substantial effort has been put into the functional annotation of variation in human genome sequence. Such annotations can play a critical role in identifying putatively causal variants among the abundant natural variation that occurs at a locus of interest. The main challenges in using these various annotations include their large numbers, and their diversity. Here we develop an unsupervised approach to integrate these different annotations into one measure of functional importance (Eigen), that, unlike most existing methods, is not based on any labeled training data. We show that the resulting meta-score has better discriminatory ability using disease associated and putatively benign variants from published studies (in both coding and noncoding regions) compared with the recently proposed CADD score. Across varied scenarios, the Eigen score performs generally better than any single individual annotation, representing a powerful single functional score that can be incorporated in fine-mapping studies. PMID:26727659

  16. Neurolinguistic Annotated Bibliography (Brain Research and Language Function) with Implications for Education.

    ERIC Educational Resources Information Center

    Davis, Wesley K.

    This bibliography presents annotations of 91 journal articles, books, chapters in books, and conference papers dating from 1967 to 1984 concerning neurolinguistics, language processing, and educational implications of brain research. The annotated bibliography includes eight items on neuroanatomy and language function; 20 items on neurolinguistics…

  17. GO-FAANG meeting: A gathering on functional annotation of animal genomes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The FAANG (Functional Annotation of Animal Genomes) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of non-model organisms (www.faang.or...

  18. Identifying functionally important cis-peptide containing segments in proteins and their utility in molecular function annotation.

    PubMed

    Das, Sreetama; Ramakumar, Suryanarayanarao; Pal, Debnath

    2014-12-01

    Cis-peptide embedded segments are rare in proteins but often highlight their important role in molecular function when they do occur. The high evolutionary conservation of these segments illustrates this observation almost universally, although no attempt has been made to systematically use this information for the purpose of function annotation. In the present study, we demonstrate how geometric clustering and level-specific Gene Ontology molecular-function terms (also known as annotations) can be used in a statistically significant manner to identify cis-embedded segments in a protein linked to its molecular function. The present study identifies novel cis-peptide fragments, which are subsequently used for fragment-based function annotation. Annotation recall benchmarks interpreted using the receiver-operator characteristic plot returned an area-under-curve > 0.9, corroborating the utility of the annotation method. In addition, we identified cis-peptide fragments occurring in conjunction with functionally important trans-peptide fragments, providing additional insights into molecular function. We further illustrate the applicability of our method in function annotation where homology-based annotation transfer is not possible. The findings of the present study add to the repertoire of function annotation approaches and also facilitate engineering, design and allied studies around the cis-peptide neighborhood of proteins.

  19. Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment.

    PubMed

    Shameer, Khader; Tripathi, Lokesh P; Kalari, Krishna R; Dudley, Joel T; Sowdhamini, Ramanathan

    2016-09-01

    Accurate assessment of genetic variation in human DNA sequencing studies remains a nontrivial challenge in clinical genomics and genome informatics. Ascribing functional roles and/or clinical significances to single nucleotide variants identified from a next-generation sequencing study is an important step in genome interpretation. Experimental characterization of all the observed functional variants is yet impractical; thus, the prediction of functional and/or regulatory impacts of the various mutations using in silico approaches is an important step toward the identification of functionally significant or clinically actionable variants. The relationships between genotypes and the expressed phenotypes are multilayered and biologically complex; such relationships present numerous challenges and at the same time offer various opportunities for the design of in silico variant assessment strategies. Over the past decade, many bioinformatics algorithms have been developed to predict functional consequences of single nucleotide variants in the protein coding regions. In this review, we provide an overview of the bioinformatics resources for the prediction, annotation and visualization of coding single nucleotide variants. We discuss the currently available approaches and major challenges from the perspective of protein sequence, structure, function and interactions that require consideration when interpreting the impact of putatively functional variants. We also discuss the relevance of incorporating integrated workflows for predicting the biomedical impact of the functionally important variations encoded in a genome, exome or transcriptome. Finally, we propose a framework to classify variant assessment approaches and strategies for incorporation of variant assessment within electronic health records.

  20. Method for the Compound Annotation of Conjugates in Nontargeted Metabolomics Using Accurate Mass Spectrometry, Multistage Product Ion Spectra and Compound Database Searching.

    PubMed

    Ogura, Tairo; Bamba, Takeshi; Tai, Akihiro; Fukusaki, Eiichiro

    2015-01-01

    Owing to biotransformation, xenobiotics are often found in conjugated form in biological samples such as urine and plasma. Liquid chromatography coupled with accurate mass spectrometry with multistage collision-induced dissociation provides spectral information concerning these metabolites in complex materials. Unfortunately, compound databases typically do not contain a sufficient number of records for such conjugates. We report here on the development of a novel protocol, referred to as ChemProphet, to annotate compounds, including conjugates, using compound databases such as PubChem and ChemSpider. The annotation of conjugates involves three steps: 1. Recognition of the type and number of conjugates in the sample; 2. Compound search and annotation of the deconjugated form; and 3. In silico evaluation of the candidate conjugate. ChemProphet assigns a spectrum to each candidate by automatically exploring the substructures corresponding to the observed product ion spectrum. When finished, it annotates the candidates assigning a rank for each candidate based on the calculated score that ranks its relative likelihood. We assessed our protocol by annotating a benchmark dataset by including the product ion spectra for 102 compounds, annotating the commercially available standard for quercetin 3-glucuronide, and by conducting a model experiment using urine from mice that had been administered a green tea extract. The results show that by using the ChemProphet approach, it is possible to annotate not only the deconjugated molecules but also the conjugated molecules using an automatic interpretation method based on deconjugation that involves multistage collision-induced dissociation and in silico calculated conjugation.

  1. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology.

    PubMed

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang; Wang, Yadong; Jin, Shuilin; Cheng, Liang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e - 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e - 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  2. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    PubMed Central

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  3. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    PubMed Central

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable.

  4. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    SciTech Connect

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.

  5. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGES

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  6. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    PubMed

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.

  7. Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data.

    PubMed

    Lohse, Marc; Nagel, Axel; Herter, Thomas; May, Patrick; Schroda, Michael; Zrenner, Rita; Tohge, Takayuki; Fernie, Alisdair R; Stitt, Mark; Usadel, Björn

    2014-05-01

    Next-generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan 'BIN' ontology, which is tailored for functional annotation of plant 'omics' data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan-to-GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator.

  8. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  9. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    DOE PAGES

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-05-15

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less

  10. SIFTER search: a web server for accurate phylogeny-based protein function prediction.

    PubMed

    Sahraeian, Sayed M; Luo, Kevin R; Brenner, Steven E

    2015-07-01

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded. PMID:25979264

  11. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    SciTech Connect

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-05-15

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.

  12. What's Normal? Accurately and Efficiently Assessing Menstrual Function.

    PubMed

    Takemoto, Darcie M; Beharry, Meera S

    2015-09-01

    Many young women are unsure of what constitutes normal menses. By asking focused questions, pediatric providers can quickly and accurately assess menstrual function and dispel anxiety and myths. In this article, we review signs and symptoms of normal versus pathologic menstrual functioning and provide suggestions to improve menstrual history taking.

  13. Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum

    PubMed Central

    2010-01-01

    Background Plasmodium falciparum is the main causative agent of malaria. Of the 5 484 predicted genes of P. falciparum, about 57% do not have sufficient sequence similarity to characterized genes in other species to warrant functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Gene expression data have been widely used in the recent years to help functional annotation in an intra-species way via the so-called Guilt By Association (GBA) principle. Results We propose a new method that uses gene expression data to assess inter-species annotation transfers. Our approach starts from a set of likely orthologs between a reference species (here S. cerevisiae and D. melanogaster) and a query species (P. falciparum). It aims at identifying clusters of coexpressed genes in the query species whose coexpression has been conserved in the reference species. These conserved clusters of coexpressed genes are then used to assess annotation transfers between genes with low sequence similarity, enabling reliable transfers of annotations from the reference to the query species. The approach was used with transcriptomic data sets of P. falciparum, S. cerevisiae and D. melanogaster, and enabled us to propose with high confidence new/refined annotations for several dozens hypothetical/putative P. falciparum genes. Notably, we revised the annotation of genes involved in ribosomal proteins and ribosome biogenesis and assembly, thus highlighting several potential drug targets. Conclusions Our approach uses both sequence similarity and gene expression data to help inter-species gene annotation transfers. Experiments show that this strategy improves the accuracy achieved when using solely sequence similarity and outperforms the accuracy of the GBA approach. In addition, our experiments with P. falciparum show that it can infer a function for numerous hypothetical genes. PMID:20078859

  14. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

    PubMed

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-05-27

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.

  15. BioBuilder as a database development and functional annotation platform for proteins

    PubMed Central

    Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

    2004-01-01

    Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

  16. Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations.

    PubMed

    Kolker, Eugene; Picone, Alex F; Galperin, Michael Y; Romine, Margaret F; Higdon, Roger; Makarova, Kira S; Kolker, Natali; Anderson, Gordon A; Qiu, Xiaoyun; Auberry, Kenneth J; Babnigg, Gyorgy; Beliaev, Alex S; Edlefsen, Paul; Elias, Dwayne A; Gorby, Yuri A; Holzman, Ted; Klappenbach, Joel A; Konstantinidis, Konstantinos T; Land, Miriam L; Lipton, Mary S; McCue, Lee-Ann; Monroe, Matthew; Pasa-Tolic, Ljiljana; Pinchuk, Grigoriy; Purvine, Samuel; Serres, Margrethe H; Tsapin, Sasha; Zakrajsek, Brian A; Zhu, Wenhong; Zhou, Jizhong; Larimer, Frank W; Lawrence, Charles E; Riley, Monica; Collart, Frank R; Yates, John R; Smith, Richard D; Giometti, Carol S; Nealson, Kenneth H; Fredrickson, James K; Tiedje, James M

    2005-02-01

    The gamma-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, approximately 40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized "hypothetical" genes. We implemented an integrative approach by using experimental and computational analyses to provide more detailed insight into gene function. Global expression profiles were determined for cells after UV irradiation and under aerobic and suboxic growth conditions. Transcriptomic and proteomic analyses confidently identified 538 hypothetical genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all predicted hypothetical proteins). Publicly available analysis tools and databases and the expression data were applied to improve the annotation of these genes. The annotation results were scored by using a seven-category schema that ranked both confidence and precision of the functional assignment. We were able to identify homologs for nearly all of these hypothetical proteins (97%), but could confidently assign exact biochemical functions for only 16 proteins (category 1; 3%). Altogether, computational and experimental evidence provided functional assignments or insights for 240 more genes (categories 2-5; 45%). These functional annotations advance our understanding of genes involved in vital cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction. We propose that this integrative approach offers a valuable means to undertake the enormous challenge of characterizing the rapidly growing number of hypothetical proteins with each newly sequenced genome. PMID:15684069

  17. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations

    SciTech Connect

    Picone, Alex F.; Galperin, Michael Y.; Romine, Margaret; Higdon, Roger; Makarova, Kira S.; Kolker, Natali; Anderson, Gordon A; Qiu, Xiaoyun; Babnigg, Gyorgy; Beliaev, Alexander S; Edlefsen, Paul; Elias, Dwayne A.; Gorby, Dr. Yuri A.; Holzman, Ted; Klappenbach, Joel; Konstantinidis, Konstantinos T; Land, Miriam L; Lipton, Mary S.; McCue, Lee Ann; Monroe, Matthew; Pasa-Tolic, Ljiljana; Pinchuk, Grigoriy; Purvine, Samuel; Serres, Margrethe H.; Tsapin, Sasha; Zakrajsek, Brian A.; Zhu, Wenguang; Zhou, Jizhong; Larimer, Frank W; Lawrence, Charles E.; Riley, Monica; Collart, Frank; YatesIII, John R.; Smith, Richard D.; Nealson, Kenneth H.; Fredrickson, James K; Tiedje, James M.

    2005-01-01

    The gamma-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, approximate to40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized "hypothetical" genes. We implemented an integrative approach by using experimental and computational analyses to provide more detailed insight into gene function. Global expression profiles were determined for cells after UV irradiation and under aerobic and suboxic growth conditions. Transcriptomic and proteomic analyses confidently identified 538 hypothetical genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all predicted hypothetical proteins). Publicly available analysis tools and databases and the expression data were applied to improve the annotation of these genes. The annotation results were scored by using a seven-category schema that ranked both confidence and precision of the functional assignment. We were able to identify homologs for nearly all of these hypothetical proteins (97%), but could confidently assign exact biochemical functions for only 16 proteins (category 1; 3%). Altogether, computational and experimental evidence provided functional assignments or insights for 240 more genes (categories 2-5; 45%). These functional annotations advance our understanding of genes involved in vital cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction. We propose that this integrative approach offers a valuable means to undertake the enormous challenge of characterizing the rapidly growing number of hypothetical proteins with each newly sequenced genome.

  18. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  19. Genome-wide functional annotation of Phomopsis longicolla isolate MSPL 10-6.

    PubMed

    Darwish, Omar; Li, Shuxian; Matthews, Benjamin; Alkharouf, Nadim

    2016-06-01

    Phomopsis seed decay of soybean is caused primarily by the seed-borne fungal pathogen Phomopsis longicolla (syn. Diaporthe longicolla). This disease severely decreases soybean seed quality, reduces seedling vigor and stand establishment, and suppresses yield. It is one of the most economically important soybean diseases. In this study we annotated the entire genome of P. longicolla isolate MSPL 10-6, which was isolated from field-grown soybean seed in Mississippi, USA. This study represents the first reported genome-wide functional annotation of a seed borne fungal pathogen in the Diaporthe-Phomopsis complex. The P. longicolla genome annotation will enable research into the genetic basis of fungal infection of soybean seed and provide information for the study of soybean-fungal interactions. The genome annotation will also be a valuable resource for the research and agricultural communities. It will aid in the development of new control strategies for this pathogen. The annotations can be found from: http://bioinformatics.towson.edu/phomopsis_longicolla/download.html. NCBI accession number is: AYRD00000000. PMID:27222801

  20. Application of comparative biology in GO functional annotation: the mouse model.

    PubMed

    Drabkin, Harold J; Christie, Karen R; Dolan, Mary E; Hill, David P; Ni, Li; Sitnikov, Dmitry; Blake, Judith A

    2015-10-01

    The Gene Ontology (GO) is an important component of modern biological knowledge representation with great utility for computational analysis of genomic and genetic data. The Gene Ontology Consortium (GOC) consists of a large team of contributors including curation teams from most model organism database groups as well as curation teams focused on representation of data relevant to specific human diseases. Key to the generation of consistent and comprehensive annotations is the development and use of shared standards and measures of curation quality. The GOC engages all contributors to work to a defined standard of curation that is presented here in the context of annotation of genes in the laboratory mouse. Comprehensive understanding of the origin, epistemology, and coverage of GO annotations is essential for most effective use of GO resources. Here the application of comparative approaches to capturing functional data in the mouse system is described. PMID:26141960

  1. Guidelines for the functional annotation of microRNAs using the Gene Ontology.

    PubMed

    Huntley, Rachael P; Sitnikov, Dmitry; Orlic-Milacic, Marija; Balakrishnan, Rama; D'Eustachio, Peter; Gillespie, Marc E; Howe, Doug; Kalea, Anastasia Z; Maegdefessel, Lars; Osumi-Sutherland, David; Petri, Victoria; Smith, Jennifer R; Van Auken, Kimberly; Wood, Valerie; Zampetaki, Anna; Mayr, Manuel; Lovering, Ruth C

    2016-05-01

    MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual).

  2. SNPnexus: a web server for functional annotation of novel and publicly known genetic variants (2012 update).

    PubMed

    Dayem Ullah, Abu Z; Lemoine, Nicholas R; Chelala, Claude

    2012-07-01

    Broader functional annotation of single nucleotide variations is a valuable mean for prioritizing targets in further disease studies and large-scale genotyping projects. We originally developed SNPnexus to assess the potential significance of known and novel SNPs on the major transcriptome, proteome, regulatory and structural variation models in order to identify the phenotypically important variants. Being committed to providing continuous support to the scientific community, we have substantially improved SNPnexus over time by incorporating a broader range of variations such as insertions/deletions, block substitutions, IUPAC codes submission and region-based analysis, expanding the query size limit, and most importantly including additional categories for the assessment of functional impact. SNPnexus provides a comprehensive set of annotations for genomic variation data by characterizing related functional consequences at the transcriptome/proteome levels of seven major annotation systems with in-depth analysis of potential deleterious effects, inferring physical and cytogenetic mapping, reporting information on HapMap genotype/allele data, finding overlaps with potential regulatory elements, structural variations and conserved elements, and retrieving links with previously reported genetic disease studies. SNPnexus has a user-friendly web interface with an improved query structure, enhanced functional annotation categories and flexible output presentation making it practically useful for biologists. SNPnexus is freely available at http://www.snp-nexus.org.

  3. Guidelines for the functional annotation of microRNAs using the Gene Ontology

    PubMed Central

    D'Eustachio, Peter; Smith, Jennifer R.; Zampetaki, Anna

    2016-01-01

    MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual). PMID:26917558

  4. Guidelines for the functional annotation of microRNAs using the Gene Ontology.

    PubMed

    Huntley, Rachael P; Sitnikov, Dmitry; Orlic-Milacic, Marija; Balakrishnan, Rama; D'Eustachio, Peter; Gillespie, Marc E; Howe, Doug; Kalea, Anastasia Z; Maegdefessel, Lars; Osumi-Sutherland, David; Petri, Victoria; Smith, Jennifer R; Van Auken, Kimberly; Wood, Valerie; Zampetaki, Anna; Mayr, Manuel; Lovering, Ruth C

    2016-05-01

    MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual). PMID:26917558

  5. Structural and Functional Annotation of the Porcine Immunome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. H...

  6. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci.

    PubMed

    Trynka, Gosia; Westra, Harm-Jan; Slowikowski, Kamil; Hu, Xinli; Xu, Han; Stranger, Barbara E; Klein, Robert J; Han, Buhm; Raychaudhuri, Soumya

    2015-07-01

    Identifying genomic annotations that differentiate causal from trait-associated variants is essential to fine mapping disease loci. Although many studies have identified non-coding functional annotations that overlap disease-associated variants, these annotations often colocalize, complicating the ability to use these annotations for fine mapping causal variation. We developed a statistical approach (Genomic Annotation Shifter [GoShifter]) to assess whether enriched annotations are able to prioritize causal variation. GoShifter defines the null distribution of an annotation overlapping an allele by locally shifting annotations; this approach is less sensitive to biases arising from local genomic structure than commonly used enrichment methods that depend on SNP matching. Local shifting also allows GoShifter to identify independent causal effects from colocalizing annotations. Using GoShifter, we confirmed that variants in expression quantitative trail loci drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites and independently through 3' UTR regulation. We also showed that (1) 15%-36% of trait-associated loci map to DHSs independently of other annotations; (2) loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies; (3) variants associated with height are highly enriched in embryonic stem cell DHSs; and (4) we can effectively prioritize causal variation at specific loci.

  7. SARA: a server for function annotation of RNA structures.

    PubMed

    Capriotti, Emidio; Marti-Renom, Marc A

    2009-07-01

    Recent interest in non-coding RNA transcripts has resulted in a rapid increase of deposited RNA structures in the Protein Data Bank. However, a characterization and functional classification of the RNA structure and function space have only been partially addressed. Here, we introduce the SARA program for pair-wise alignment of RNA structures as a web server for structure-based RNA function assignment. The SARA server relies on the SARA program, which aligns two RNA structures based on a unit-vector root-mean-square approach. The likely accuracy of the SARA alignments is assessed by three different P-values estimating the statistical significance of the sequence, secondary structure and tertiary structure identity scores, respectively. Our benchmarks, which relied on a set of 419 RNA structures with known SCOR structural class, indicate that at a negative logarithm of mean P-value higher or equal than 2.5, SARA can assign the correct or a similar SCOR class to 81.4% and 95.3% of the benchmark set, respectively. The SARA server is freely accessible via the World Wide Web at http://sgu.bioinfo.cipf.es/services/SARA/.

  8. An Approach to Function Annotation for Proteins of Unknown Function (PUFs) in the Transcriptome of Indian Mulberry.

    PubMed

    Dhanyalakshmi, K H; Naika, Mahantesha B N; Sajeevan, R S; Mathew, Oommen K; Shafi, K Mohamed; Sowdhamini, Ramanathan; N Nataraja, Karaba

    2016-01-01

    The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs). Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS), which also provides a web service API (Application Programming Interface) for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas. PMID:26982336

  9. An Approach to Function Annotation for Proteins of Unknown Function (PUFs) in the Transcriptome of Indian Mulberry

    PubMed Central

    Dhanyalakshmi, K. H.; Naika, Mahantesha B. N.; Sajeevan, R. S.; Mathew, Oommen K.; Shafi, K. Mohamed; Sowdhamini, Ramanathan; N. Nataraja, Karaba

    2016-01-01

    The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs). Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS), which also provides a web service API (Application Programming Interface) for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas. PMID:26982336

  10. An Approach to Function Annotation for Proteins of Unknown Function (PUFs) in the Transcriptome of Indian Mulberry.

    PubMed

    Dhanyalakshmi, K H; Naika, Mahantesha B N; Sajeevan, R S; Mathew, Oommen K; Shafi, K Mohamed; Sowdhamini, Ramanathan; N Nataraja, Karaba

    2016-01-01

    The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs). Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS), which also provides a web service API (Application Programming Interface) for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas.

  11. The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

    PubMed Central

    Yu, Chenggang; Zavaljevski, Nela; Desai, Valmik; Johnson, Seth; Stevens, Fred J; Reifman, Jaques

    2008-01-01

    Background Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and

  12. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    PubMed

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function. PMID:27388471

  13. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  14. Genome-wide functional annotation and structural verification of metabolic ORFeome of Chlamydomonas reinhardtii

    PubMed Central

    2011-01-01

    Background Recent advances in the field of metabolic engineering have been expedited by the availability of genome sequences and metabolic modelling approaches. The complete sequencing of the C. reinhardtii genome has made this unicellular alga a good candidate for metabolic engineering studies; however, the annotation of the relevant genes has not been validated and the much-needed metabolic ORFeome is currently unavailable. We describe our efforts on the functional annotation of the ORF models released by the Joint Genome Institute (JGI), prediction of their subcellular localizations, and experimental verification of their structural annotation at the genome scale. Results We assigned enzymatic functions to the translated JGI ORF models of C. reinhardtii by reciprocal BLAST searches of the putative proteome against the UniProt and AraCyc enzyme databases. The best match for each translated ORF was identified and the EC numbers were transferred onto the ORF models. Enzymatic functional assignment was extended to the paralogs of the ORFs by clustering ORFs using BLASTCLUST. In total, we assigned 911 enzymatic functions, including 886 EC numbers, to 1,427 transcripts. We further annotated the enzymatic ORFs by prediction of their subcellular localization. The majority of the ORFs are predicted to be compartmentalized in the cytosol and chloroplast. We verified the structure of the metabolism-related ORF models by reverse transcription-PCR of the functionally annotated ORFs. Following amplification and cloning, we carried out 454FLX and Sanger sequencing of the ORFs. Based on alignment of the 454FLX reads to the ORF predicted sequences, we obtained more than 90% coverage for more than 80% of the ORFs. In total, 1,087 ORF models were verified by 454 and Sanger sequencing methods. We obtained expression evidence for 98% of the metabolic ORFs in the algal cells grown under constant light in the presence of acetate. Conclusions We functionally annotated approximately 1

  15. Functional annotation of an expressed sequence tag library from Haliotis diversicolor and analysis of its plant-like sequences.

    PubMed

    Jiang, Jing-Zhe; Zhang, Wei; Guo, Zhi-Xun; Cai, Chen-Chen; Su, You-Lu; Wang, Rui-Xuan; Wang, Jiang-Yong

    2011-09-01

    The small abalone, Haliotis diversicolor, is a widely distributed and cultured species in the subtropical coastal area of China. To identify and classify functional genes of this important species, a normalized expressed sequence tag (EST) library, including 7069 high quality ESTs from the total body of H. diversicolor, was analyzed. A total of 4781 unigenes were assembled and 2991 novel abalone genes were identified. The GC content, codon and amino acid usage of the transcriptome were analyzed. For the accurate annotation of the abalone library, different influencing factors were evaluated. The gene ontology (GO) database provided a higher annotation rate (69.6%), and sequences longer than 800bp were easily subjected to a BLAST search. The taxonomy of the BLAST results showed that lancelet and invertebrates are most closely related to abalone. Sixty-seven identified plant-like genes were further examined by reverse transcription-polymerase chain reaction (RT-PCR) and sequencing, only seven of these were real transcripts in abalone. Phylogenic trees were also constructed to illustrate the positions of two Cystatin sequences and one Calmodulin protein sequence identified in abalone. To perform functional classification, three different databases (GO, KEGG and COG) were used and 60 immune or disease-related unigenes were determined. This work has greatly enlarged the known gene pool of H. diversicolor and will have important implications for future molecular and genetic analyses in this organism.

  16. Protein intrinsic disorder within the Potyvirus genus: from proteome-wide analysis to functional annotation.

    PubMed

    Charon, Justine; Theil, Sébastien; Nicaise, Valérie; Michon, Thierry

    2016-02-01

    Within proteins, intrinsically disordered regions (IDRs) are devoid of stable secondary and tertiary structures under physiological conditions and rather exist as dynamic ensembles of inter-converting conformers. Although ubiquitous in all domains of life, the intrinsic disorder content is highly variable in viral genomes. Over the years, functional annotations of disordered regions at the scale of the whole proteome have been conducted for several animal viruses. But to date, similar studies applied to plant viruses are still missing. Based on disorder prediction tools combined with annotation programs and evolutionary studies, we analyzed the intrinsic disorder content in Potyvirus, using a 10-species dataset representative of this genus diversity. In this paper, we revealed that: (i) the Potyvirus proteome displays high disorder content, (ii) disorder is conserved during Potyvirus evolution, suggesting a functional advantage of IDRs, (iii) IDRs evolve faster than ordered regions, and (iv) IDRs may be associated with major biological functions required for the Potyvirus cycle. Notably, the proteins P1, Coat protein (CP) and Viral genome-linked protein (VPg) display a high content of conserved disorder, enriched in specific motifs mimicking eukaryotic functional modules and suggesting strategies of host machinery hijacking. In these three proteins, IDRs are particularly conserved despite their high amino acid polymorphism, indicating a link to adaptive processes. Through this comprehensive study, we further investigate the biological relevance of intrinsic disorder in Potyvirus biology and we propose a functional annotation of potyviral proteome IDRs. PMID:26699268

  17. Protein intrinsic disorder within the Potyvirus genus: from proteome-wide analysis to functional annotation.

    PubMed

    Charon, Justine; Theil, Sébastien; Nicaise, Valérie; Michon, Thierry

    2016-02-01

    Within proteins, intrinsically disordered regions (IDRs) are devoid of stable secondary and tertiary structures under physiological conditions and rather exist as dynamic ensembles of inter-converting conformers. Although ubiquitous in all domains of life, the intrinsic disorder content is highly variable in viral genomes. Over the years, functional annotations of disordered regions at the scale of the whole proteome have been conducted for several animal viruses. But to date, similar studies applied to plant viruses are still missing. Based on disorder prediction tools combined with annotation programs and evolutionary studies, we analyzed the intrinsic disorder content in Potyvirus, using a 10-species dataset representative of this genus diversity. In this paper, we revealed that: (i) the Potyvirus proteome displays high disorder content, (ii) disorder is conserved during Potyvirus evolution, suggesting a functional advantage of IDRs, (iii) IDRs evolve faster than ordered regions, and (iv) IDRs may be associated with major biological functions required for the Potyvirus cycle. Notably, the proteins P1, Coat protein (CP) and Viral genome-linked protein (VPg) display a high content of conserved disorder, enriched in specific motifs mimicking eukaryotic functional modules and suggesting strategies of host machinery hijacking. In these three proteins, IDRs are particularly conserved despite their high amino acid polymorphism, indicating a link to adaptive processes. Through this comprehensive study, we further investigate the biological relevance of intrinsic disorder in Potyvirus biology and we propose a functional annotation of potyviral proteome IDRs.

  18. Coordinated international action to accelerate genome-to-phenome with FAANG, The Functional Annotation of Animal Genomes project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  19. The Function of Annotations in the Comprehension of Scientific Texts: Cognitive Load Effects and the Impact of Verbal Ability

    ERIC Educational Resources Information Center

    Wallen, Erik; Plass, Jan L.; Brunken, Roland

    2005-01-01

    Students participated in a study (n = 98) investigating the effectiveness of three types of annotations on three learning outcome measures. The annotations were designed to support the cognitive processes in the comprehension of scientific texts, with a function to aid either the process of selecting relevant information, organizing the…

  20. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    DOE PAGES

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle

    2014-10-09

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accuratemore » comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.« less

  1. Transcriptomal changes and functional annotation of the developing non-human primate choroid plexus

    PubMed Central

    Ek, C. Joakim; Nathanielsz, Peter; Li, Cun; Mallard, Carina

    2015-01-01

    The choroid plexuses are small organs that protrude into each brain ventricle producing cerebrospinal fluid that constantly bathes the brain. These organs differentiate early in development just after neural closure at a stage when the brain is little vascularized. In recent years the plexus has been shown to have a much more active role in brain development than previously appreciated thereby it can influence both neurogenesis and neural migration by secreting factors into the CSF. However, much of choroid plexus developmental function is still unclear. Most previous studies on this organ have been undertaken in rodents but translation into humans is not straightforward since they have a different timing of brain maturation processes. We have collected choroid plexus from three fetal gestational ages of a non-human primate, the baboon, which has much closer brain development to humans. The transcriptome of the plexuses was determined by next generation sequencing and Ingenuity Pathway Analysis software was used to annotate functions and enrichment of pathways of changes in the transcriptome. The number of unique transcripts decreased with development and the majority of differentially expressed transcripts were down-regulated through development suggesting a more complex and active plexus earlier in fetal development. The functional annotation indicated changes across widespread biological functions in plexus development. In particular we find age-dependent regulation of genes associated with annotation categories: Gene Expression, Development of Cardiovascular System, Nervous System Development and Molecular Transport. Our observations support the idea that the choroid plexus has roles in shaping brain development. PMID:25814924

  2. Report on the 2011 Critical Assessment of Function Annotation (CAFA) meeting

    SciTech Connect

    Friedberg, Iddo

    2015-01-21

    The Critical Assessment of Function Annotation meeting was held July 14-15, 2011 at the Austria Conference Center in Vienna, Austria. There were 73 registered delegates at the meeting. We thank the DOE for this award. It helped us organize and support a scientific meeting AFP 2011 as a special interest group (SIG) meeting associated with the ISMB 2011 conference. The conference was held in Vienna, Austria, in July 2011. The AFP SIG was held on July 15-16, 2011 (immediately preceding the conference). The meeting consisted of two components, the first being a series of talks (invited and contributed) and discussion sections dedicated to protein function research, with an emphasis on the theory and practice of computational methods utilized in functional annotation. The second component provided a large-scale assessment of computational methods through participation in the Critical Assessment of Functional Annotation (CAFA). The meeting was exciting and, based on feedback, quite successful. There were 73 registered participants. The schedule was only slightly different from the one proposed, due to two cancellations. Dr. Olga Troyanskaya has canceled and we invited Dr. David Jones instead. Similarly, instead of Dr. Richard Roberts, Dr. Simon Kasif gave a closing keynote. The remaining invited speakers were Janet Thornton (EBI) and Amos Bairoch (University of Geneva).

  3. [Functional annotation of rice WRKY transcription factors based on their transcriptional features].

    PubMed

    Liyun, Li; Jianan, Shi; Shuo, Yang; Caiqiang, Sun; Guozhen, Liu

    2016-02-01

    Transcription factors regulate alteration of transcription levels. Recently, huge amount of transcriptomic data are accumulated via the application of high throughput sequencing technology, and it is reasonable to postulate that in-depth analysis of transcription data could be used to enhance gene annotation. In this study, we chose the gene family of rice WRKY transcription factors. Based on literature search, the transcriptional data under different biological processes, including biotic and abiotic stress, development, and nutrient absorption and hormone treatments were analyzed systematically. To the end, we summarize the list of differentially expressed WRKY genes. We also expect that such information will enrich their functional annotation and also provide direct clues for subsequent functional studies. PMID:26907776

  4. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns

    PubMed Central

    Christie, Karen R.; Hong, Eurie L.; Cherry, J. Michael

    2011-01-01

    The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions. PMID:19577472

  5. Functional annotation and kinetic characterization of PhnO from Salmonella enterica.

    PubMed Central

    Errey, James C.; Blanchard, John S.

    2008-01-01

    Phosphorus is an essential nutrient for all living organisms. Under conditions of inorganic phosphate starvation, genes from the Pho regulon are induced allowing microorganisms to use phosphonates as a source of phosphorus. The phnO gene was previously annotated as a transcriptional regulator of unknown function due to sequence homology with members of the GCN5-related N-acyltransferase family (GNAT). PhnO can now be functionally annotated as an aminoalkylphosphonic acid N-acetyltransferase which is able to acetylate a range of aminoalkylphosphonic acids. Studies revealed that PhnO proceeds via an ordered, sequential kinetic mechanism with acetyl-CoA binding first followed by aminoalkylphosphonate. Attack by the amine on the thioester of AcCoA generates the tetrahedral intermediate that collapses to generate the products. The enzyme also requires a divalent metal ion for activity, which is the first example of this requirement for a GNAT family member. PMID:16503658

  6. Gene Expression and Functional Annotation of the Human and Mouse Choroid Plexus Epithelium

    PubMed Central

    Janssen, Sarah F.; van der Spek, Sophie J. F.; ten Brink, Jacoline B.; Essing, Anke H. W.; Gorgels, Theo G. M. F.; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

    2013-01-01

    Background The choroid plexus epithelium (CPE) is a lobed neuro-epithelial structure that forms the outer blood-brain barrier. The CPE protrudes into the brain ventricles and produces the cerebrospinal fluid (CSF), which is crucial for brain homeostasis. Malfunction of the CPE is possibly implicated in disorders like Alzheimer disease, hydrocephalus or glaucoma. To study human genetic diseases and potential new therapies, mouse models are widely used. This requires a detailed knowledge of similarities and differences in gene expression and functional annotation between the species. The aim of this study is to analyze and compare gene expression and functional annotation of healthy human and mouse CPE. Methods We performed 44k Agilent microarray hybridizations with RNA derived from laser dissected healthy human and mouse CPE cells. We functionally annotated and compared the gene expression data of human and mouse CPE using the knowledge database Ingenuity. We searched for common and species specific gene expression patterns and function between human and mouse CPE. We also made a comparison with previously published CPE human and mouse gene expression data. Results Overall, the human and mouse CPE transcriptomes are very similar. Their major functionalities included epithelial junctions, transport, energy production, neuro-endocrine signaling, as well as immunological, neurological and hematological functions and disorders. The mouse CPE presented two additional functions not found in the human CPE: carbohydrate metabolism and a more extensive list of (neural) developmental functions. We found three genes specifically expressed in the mouse CPE compared to human CPE, being ACE, PON1 and TRIM3 and no human specifically expressed CPE genes compared to mouse CPE. Conclusion Human and mouse CPE transcriptomes are very similar, and display many common functionalities. Nonetheless, we also identified a few genes and pathways which suggest that the CPE between mouse and

  7. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  8. Gene annotation and functional analysis of a newly sequenced Synechococcus strain.

    PubMed

    Li, Y; Rao, N N; Yang, Y; Zhang, Y; Gu, Y N

    2015-10-16

    Synechococcus sp PCC 7336 represents a newly sequenced strain, and its genome is obviously different from that of other Synechococcus strains. In this analysis, local alignment and annotation databases were constructed and combined with various bioinformatic tools to carry out gene annotation and functional analysis of this strain. From this analysis, we identified 5096 protein-coding genes and 47 RNA genes. Of these, 116 genes that were classified into 9 categories were associated with photosynthesis, and type V polymerase proteins that were identified are unique for this strain. An additional 107 genes were closely related to signal transduction pathways, which primarily comprised parts of two-component regulatory systems. Gene ontogeny analysis showed that 2377 genes were annotated with a total number of 9791 functional categories, and specifically that 41 genes distributed in 4 protein complexes were involved in oxidative phosphorylation. Clusters of orthologous groups classification showed that there were 1463 homologous proteins associated with 17 specific metabolic pathways, and that most of the proteins participated in primary metabolic processes such as binding and catalysis. The phylogenetic tree based on 16S rRNA sequences indicated that Synechococcus PCC 7336 is highly likely to represent a new branch.

  9. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-12-01

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.

  10. Functional annotation of the human chromosome 7 "missing" proteins: a bioinformatics approach.

    PubMed

    Ranganathan, Shoba; Khan, Javed M; Garg, Gagan; Baker, Mark S

    2013-06-01

    The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered "missing" in this study as they lack experimental support. We have developed a protocol for the functional annotation of these "missing" proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 "missing" proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the "missing" chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data.

  11. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation

    PubMed Central

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2015-01-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs. PMID:26862572

  12. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation.

    PubMed

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2016-03-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs. PMID:26862572

  13. Analysis and functional annotation of expressed sequence tags of water buffalo.

    PubMed

    Bajetha, Garima; Bhati, Jyotika; Sarika; Iquebal, M A; Rai, Anil; Arora, Vasu; Kumar, Dinesh

    2013-01-01

    An elucidated genome of domestic livestock river buffalo will contribute enormously to economy and better understanding of genome evolution as well. An attempt is made to obtain genomic information on buffalo, based on total Expressed Sequence Tags (ESTs) of Bubalus bubalis available in public domain. These ESTs were annotated and classified into 15 different functional categories based on their homology to the known proteins. Interestingly, 41.79% of the contigs were found to be buffalo specific novel ESTs with respect to other species used in analysis which needs further studies. Also, 224 pSNPs (putative Single Nucleotide Polymorphism) were detected. This study will provide a home base for further genomic studies of buffalo and comparative studies enabling a starting point for the genome annotation of the organism. Supplementary materials are available for this article online.

  14. Finding New Order in Biological Functions from the Network Structure of Gene Annotations

    PubMed Central

    Glass, Kimberly; Girvan, Michelle

    2015-01-01

    The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. PMID:26588252

  15. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  16. Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning.

    PubMed

    Panwar, Bharat; Menon, Rajasree; Eksi, Ridvan; Li, Hong-Dong; Omenn, Gilbert S; Guan, Yuanfang

    2016-06-01

    The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level. In the present study, we used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. We used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. We demonstrated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. All predictions have been implemented in a web resource called "IsoFunc", which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc . PMID:27142340

  17. Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization

    PubMed Central

    Song, Baoxing; Xu, Jian; Ning, Kang

    2014-01-01

    The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale. PMID:24595159

  18. Parallel-META 2.0: enhanced metagenomic data analysis with functional annotation, high performance computing and advanced visualization.

    PubMed

    Su, Xiaoquan; Pan, Weihua; Song, Baoxing; Xu, Jian; Ning, Kang

    2014-01-01

    The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale.

  19. Accurate ionization potential of semiconductors from efficient density functional calculations

    NASA Astrophysics Data System (ADS)

    Ye, Lin-Hui

    2016-07-01

    Despite its huge successes in total-energy-related applications, the Kohn-Sham scheme of density functional theory cannot get reliable single-particle excitation energies for solids. In particular, it has not been able to calculate the ionization potential (IP), one of the most important material parameters, for semiconductors. We illustrate that an approximate exact-exchange optimized effective potential (EXX-OEP), the Becke-Johnson exchange, can be used to largely solve this long-standing problem. For a group of 17 semiconductors, we have obtained the IPs to an accuracy similar to that of the much more sophisticated G W approximation (GWA), with the computational cost of only local-density approximation/generalized gradient approximation. The EXX-OEP, therefore, is likely as useful for solids as for finite systems. For solid surfaces, the asymptotic behavior of the vx c has effects similar to those of finite systems which, when neglected, typically cause the semiconductor IPs to be underestimated. This may partially explain why standard GWA systematically underestimates the IPs and why using the same GWA procedures has not been able to get an accurate IP and band gap at the same time.

  20. De novo RNA-Seq and functional annotation of Sarcoptes scabiei canis.

    PubMed

    Hu, Li; Zhao, YaE; Yang, YuanJun; Niu, DongLing; Wang, RuiLing; Cheng, Juan; Yang, Fan

    2016-07-01

    The transcriptomic data of Sarcoptes is still lacking in the public database due to the difficulty in extracting high-quality RNA from tiny mites with thick chitin. In this study, total RNA was extracted from live Sarcoptes mites for quality assessment, RNA-Seq, functional annotation, and coding region (CD) prediction and verification. The results showed that the sample JMQ-lngm was qualified for cDNA library construction. Firstly, Agilent 2100 detection showed that the RNA baseline was smooth and the 18S peak was single. Second, the Illumina platform generated 65.78M clean reads and 20,826 unigenes with 35.43M were assembled, occupying 62.98 % of the 56.26M genome. In total, 15,034 unigenes were annotated in seven functional databases. Finally, 13,122 CDs were detected in the 20,826 unigenes, of which 70 complete CDs were matched with Sarcoptes manually in non-redundant nucleotide (NT). Three CDs with indels ≥10 bp were verified. Those results indicated that peritrophin sequences of JMQ-lngm missed 35 bp during the assembly; the pressure-sensitive sodium channel sequences of all the six Sarcoptes scabiei canis isolates were confirmed to be 90 bp shorter than that of a Sarcoptes scabiei hominis isolate; three introns remained in PH chlorine ion channel gating sequences of JMQ-lngm. Moreover, the allergen gene prediction for JMQ-lngm indicated that 61 unigenes were matched with 19 allergen genes of Dermatophagoides, of which Der 1, Der 3, Der 8, and Der 10 had been confirmed in NT. In conclusion, this study successfully completed the RNA-Seq and functional annotation of S. s. canis for the first time, which provides molecular data for future studies on the identification and pathogenic genes of Sarcoptidae. PMID:26997341

  1. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  2. The SOFG Anatomy Entry List (SAEL): An Annotation Tool for Functional Genomics Data

    PubMed Central

    Parkinson, Helen; Aitken, Stuart; Baldock, Richard A.; Bard, Jonathan B. L.; Burger, Albert; Hayamizu, Terry F.; Rector, Alan; Ringwald, Martin; Rogers, Jeremy; Rosse, Cornelius; Stoeckert, Christian J.

    2004-01-01

    A great deal of data in functional genomics studies needs to be annotated with low-resolution anatomical terms. For example, gene expression assays based on manually dissected samples (microarray, SAGE, etc.) need high-level anatomical terms to describe sample origin. First-pass annotation in high-throughput assays (e.g. large-scale in situ gene expression screens or phenotype screens) and bibliographic applications, such as selection of keywords, would also benefit from a minimum set of standard anatomical terms. Although only simple terms are required, the researcher faces serious practical problems of inconsistency and confusion, given the different aims and the range of complexity of existing anatomy ontologies. A Standards and Ontologies for Functional Genomics (SOFG) group therefore initiated discussions between several of the major anatomical ontologies for higher vertebrates. As we report here, one result of these discussions is a simple, accessible, controlled vocabulary of gross anatomical terms, the SOFG Anatomy Entry List (SAEL). The SAEL is available from http://www.sofg.org and is intended as a resource for biologists, curators, bioinformaticians and developers of software supporting functional genomics. It can be used directly for annotation in the contexts described above. Importantly, each term is linked to the corresponding term in each of the major anatomy ontologies. Where the simple list does not provide enough detail or sophistication, therefore, the researcher can use the SAEL to choose the appropriate ontology and move directly to the relevant term as an entry point. The SAEL links will also be used to support computational access to the respective ontologies. PMID:18629134

  3. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  4. Identification and functional annotation of mycobacterial septum formation genes using cell division mutants of Escherichia coli.

    PubMed

    Gaiwala Sharma, Sujata S; Kishore, Vimal; Raghunand, Tirumalai R

    2016-01-01

    The major virulence trait of Mycobacterium tuberculosis is its ability to enter a latent state in the face of robust host immunity. Clues to the molecular basis of latency can emerge from understanding the mechanism of cell division, beginning with identification of proteins involved in this process. Using complementation of Escherichia coli mutants, we functionally annotated M. tuberculosis and Mycobacterium smegmatis homologs of divisome proteins FtsW and AmiC. Our results demonstrate that E. coli can be used as a surrogate model to discover mycobacterial cell division genes, and should prove invaluable in delineating the mechanisms of this fundamental process in mycobacteria.

  5. Functional annotation of heart enriched mitochondrial genes GBAS and CHCHD10 through guilt by association.

    PubMed

    Martherus, Ruben S R M; Sluiter, Willem; Timmer, Erika D J; VanHerle, Sabina J V; Smeets, Hubert J M; Ayoubi, Torik A Y

    2010-11-12

    Despite the mitochondria ubiquitous nature many of their components display divergences in their expression profile across different tissues. Using the bioinformatics-approach of guilt by association (GBA) we exploited these variations to predict the function of two so far poorly annotated genes: Coiled-coil-helix-coiled-coil-helix domain containing 10 (CHCHD10) and glioblastoma amplified sequence (GBAS). We predicted both genes to be involved in oxidative phosphorylation. Through in vitro experiments using gene-knockdown we could indeed confirm this and furthermore we asserted CHCHD10 to play a role in complex IV activity. PMID:20888800

  6. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing

  7. An Accurate Method for Prediction of Protein-Ligand Binding Site on Protein Surface Using SVM and Statistical Depth Function

    PubMed Central

    Wang, Kui; Gao, Jianzhao; Shen, Shiyi; Tuszynski, Jack A.; Ruan, Jishou

    2013-01-01

    Since proteins carry out their functions through interactions with other molecules, accurately identifying the protein-ligand binding site plays an important role in protein functional annotation and rational drug discovery. In the past two decades, a lot of algorithms were present to predict the protein-ligand binding site. In this paper, we introduce statistical depth function to define negative samples and propose an SVM-based method which integrates sequence and structural information to predict binding site. The results show that the present method performs better than the existent ones. The accuracy, sensitivity, and specificity on training set are 77.55%, 56.15%, and 87.96%, respectively; on the independent test set, the accuracy, sensitivity, and specificity are 80.36%, 53.53%, and 92.38%, respectively. PMID:24195070

  8. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  9. Integrating biological knowledge based on functional annotations for biclustering of gene expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2015-05-01

    Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows

  10. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed Central

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-01-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  11. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

    PubMed Central

    O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  12. Taxonomic and functional annotation of gut bacterial communities of Eisenia foetida and Perionyx excavatus.

    PubMed

    Singh, Arjun; Singh, Dushyant P; Tiwari, Rameshwar; Kumar, Kanika; Singh, Ran Vir; Singh, Surender; Prasanna, Radha; Saxena, Anil K; Nain, Lata

    2015-06-01

    Epigeic earthworms can significantly hasten the decomposition of organic matter, which is known to be mediated by gut associated microflora. However, there is scanty information on the abundance and diversity of the gut bacterial flora in different earthworm genera fed with a similar diet, particularly Eisenia foetida and Perionyx excavatus. In this context, 16S rDNA based clonal survey of gut metagenomic DNA was assessed after growth of these two earthworms on lignocellulosic biomass. A set of 67 clonal sequences belonging to E. foetida and 75 to P. excavatus were taxonomically annotated using MG-RAST and RDP pipeline servers. Highest number of sequences were annotated to Proteobacteria (38-44%), followed by unclassified bacteria (14-18%) and Firmicutes (9.3-11%). Comparative analyses revealed significantly higher abundance of Actinobacteria and Firmicutes in the gut of P. excavatus. The functional annotation for the 16S rDNA clonal libraries of both the metagenomes revealed a high abundance of xylan degraders (12.1-24.1%). However, chitin degraders (16.7%), ammonia oxidizers (24.1%) and nitrogen fixers (7.4%) were relatively higher in E. foetida, while in P. excavatus; sulphate reducers and sulphate oxidizers (12.1-29.6%) were more abundant. Lignin degradation was detected in 3.7% clones of E. foetida, while cellulose degraders represented 1.7%. The gut microbiomes showed relative abundance of dehalogenators (17.2-22.2%) and aromatic hydrocarbon degraders (1.7-5.6%), illustrating their role in bioremediation. This study highlights the significance of differences in the inherent microbiome of these two earthworms in shaping the metagenome for effective degradation of different types of biomass under tropical conditions. PMID:25813857

  13. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    PubMed Central

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  14. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    PubMed

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established.

  15. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcine P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-03-17

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes.

  16. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    PubMed

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established. PMID:27453069

  17. Functional annotation of the vlinc class of non-coding RNAs using systems biology approach

    PubMed Central

    Laurent, Georges St.; Vyatkin, Yuri; Antonets, Denis; Ri, Maxim; Qi, Yao; Saik, Olga; Shtokalo, Dmitry; de Hoon, Michiel J.L.; Kawaji, Hideya; Itoh, Masayoshi; Lassmann, Timo; Arner, Erik; Forrest, Alistair R.R.; Nicolas, Estelle; McCaffrey, Timothy A.; Carninci, Piero; Hayashizaki, Yoshihide; Wahlestedt, Claes; Kapranov, Philipp

    2016-01-01

    Functionality of the non-coding transcripts encoded by the human genome is the coveted goal of the modern genomics research. While commonly relied on the classical methods of forward genetics, integration of different genomics datasets in a global Systems Biology fashion presents a more productive avenue of achieving this very complex aim. Here we report application of a Systems Biology-based approach to dissect functionality of a newly identified vast class of very long intergenic non-coding (vlinc) RNAs. Using highly quantitative FANTOM5 CAGE dataset, we show that these RNAs could be grouped into 1542 novel human genes based on analysis of insulators that we show here indeed function as genomic barrier elements. We show that vlincRNAs genes likely function in cis to activate nearby genes. This effect while most pronounced in closely spaced vlincRNA–gene pairs can be detected over relatively large genomic distances. Furthermore, we identified 101 vlincRNA genes likely involved in early embryogenesis based on patterns of their expression and regulation. We also found another 109 such genes potentially involved in cellular functions also happening at early stages of development such as proliferation, migration and apoptosis. Overall, we show that Systems Biology-based methods have great promise for functional annotation of non-coding RNAs. PMID:27001520

  18. Functional annotation of the vlinc class of non-coding RNAs using systems biology approach.

    PubMed

    St Laurent, Georges; Vyatkin, Yuri; Antonets, Denis; Ri, Maxim; Qi, Yao; Saik, Olga; Shtokalo, Dmitry; de Hoon, Michiel J L; Kawaji, Hideya; Itoh, Masayoshi; Lassmann, Timo; Arner, Erik; Forrest, Alistair R R; Nicolas, Estelle; McCaffrey, Timothy A; Carninci, Piero; Hayashizaki, Yoshihide; Wahlestedt, Claes; Kapranov, Philipp

    2016-04-20

    Functionality of the non-coding transcripts encoded by the human genome is the coveted goal of the modern genomics research. While commonly relied on the classical methods of forward genetics, integration of different genomics datasets in a global Systems Biology fashion presents a more productive avenue of achieving this very complex aim. Here we report application of a Systems Biology-based approach to dissect functionality of a newly identified vast class of very long intergenic non-coding (vlinc) RNAs. Using highly quantitative FANTOM5 CAGE dataset, we show that these RNAs could be grouped into 1542 novel human genes based on analysis of insulators that we show here indeed function as genomic barrier elements. We show that vlinc RNAs genes likely function in cisto activate nearby genes. This effect while most pronounced in closely spaced vlinc RNA-gene pairs can be detected over relatively large genomic distances. Furthermore, we identified 101 vlinc RNA genes likely involved in early embryogenesis based on patterns of their expression and regulation. We also found another 109 such genes potentially involved in cellular functions also happening at early stages of development such as proliferation, migration and apoptosis. Overall, we show that Systems Biology-based methods have great promise for functional annotation of non-coding RNAs.

  19. Exponentially accurate approximations to piece-wise smooth periodic functions

    NASA Technical Reports Server (NTRS)

    Greer, James; Banerjee, Saheb

    1995-01-01

    A family of simple, periodic basis functions with 'built-in' discontinuities are introduced, and their properties are analyzed and discussed. Some of their potential usefulness is illustrated in conjunction with the Fourier series representations of functions with discontinuities. In particular, it is demonstrated how they can be used to construct a sequence of approximations which converges exponentially in the maximum norm to a piece-wise smooth function. The theory is illustrated with several examples and the results are discussed in the context of other sequences of functions which can be used to approximate discontinuous functions.

  20. MassNet: a functional annotation service for protein mass spectrometry data

    PubMed Central

    Park, Daeui; Kim, Byoung-Chul; Cho, Seong-Woong; Park, Seong-Jin; Choi, Jong-Soon; Kim, Seung Il; Lee, Sunghoon

    2008-01-01

    Although mass spectrometry has been frequently used to identify proteins, there are no web servers that provide comprehensive functional annotation of those identified proteins. It is necessary to provide such web service due to a rapid increase in the data. We, therefore, introduce MassNet, which provides (i) physico-chemical analysis information, (ii) KEGG pathway assignment (iii) Gene Ontology mapping and (iv) protein–protein interaction (PPI) prediction for the data from MASCOT, Prospector and Profound. MassNet provides the prediction information for PPIs using both 3D structural interaction and experimental interaction deposited in PSIMAP, BIND, DIP, HPRD, IntAct, MINT, CYGD and BioGrid. The web service is freely available at http://massnet.kr or http://sequenceome.kobic.re.kr/MassNet/. PMID:18448467

  1. Improving Functional Annotation in the DRE-TIM Metallolyase Superfamily through Identification of Active Site Fingerprints.

    PubMed

    Kumar, Garima; Johnson, Jordyn L; Frantom, Patrick A

    2016-03-29

    Within the DRE-TIM metallolyase superfamily, members of the Claisen-like condensation (CC-like) subgroup catalyze C-C bond-forming reactions between various α-ketoacids and acetyl-coenzyme A. These reactions are important in the metabolic pathways of many bacterial pathogens and serve as engineering scaffolds for the production of long-chain alcohol biofuels. To improve functional annotation and identify sequences that might use novel substrates in the CC-like subgroup, a combination of structural modeling and multiple-sequence alignments identified active site residues on the third, fourth, and fifth β-strands of the TIM-barrel catalytic domain that are differentially conserved within the substrate-diverse enzyme families. Using α-isopropylmalate synthase and citramalate synthase from Methanococcus jannaschii (MjIPMS and MjCMS), site-directed mutagenesis was used to test the role of each identified position in substrate selectivity. Kinetic data suggest that residues at the β3-5 and β4-7 positions play a significant role in the selection of α-ketoisovalerate over pyruvate in MjIPMS. However, complementary substitutions in MjCMS fail to alter substrate specificity, suggesting residues in these positions do not contribute to substrate selectivity in this enzyme. Analysis of the kinetic data with respect to a protein similarity network for the CC-like subgroup suggests that evolutionarily distinct forms of IPMS utilize residues at the β3-5 and β4-7 positions to affect substrate selectivity while the different versions of CMS use unique architectures. Importantly, mapping the identities of residues at the β3-5 and β4-7 positions onto the protein similarity network allows for rapid annotation of probable IPMS enzymes as well as several outlier sequences that may represent novel functions in the subgroup. PMID:26935545

  2. Human cell adhesion molecules: annotated functional subtypes and overrepresentation of addiction-associated genes.

    PubMed

    Zhong, Xiaoming; Drgonova, Jana; Li, Chuan-Yun; Uhl, George R

    2015-09-01

    Human cell adhesion molecules (CAMs) are essential for proper development, modulation, and maintenance of interactions between cells and cell-to-cell (and matrix-to-cell) communication about these interactions. Despite the differential functional significance of these roles, there have been surprisingly few systematic studies to enumerate the universe of CAMs and identify specific CAMs in distinct functions. In this paper, we update and review the set of human genes likely to encode CAMs with searches of databases, literature reviews, and annotations. We describe likely CAMs and functional subclasses, including CAMs that have a primary function in information exchange (iCAMs), CAMs involved in focal adhesions, CAM gene products that are preferentially involved with stereotyped and morphologically identifiable connections between cells (e.g., adherens junctions, gap junctions), and smaller numbers of CAM genes in other classes. We discuss a novel proposed mechanism involving selective anchoring of the constituents of iCAM-containing lipid rafts in zones of close neuronal apposition to membranes expressing iCAM binding partners. We also discuss data from genetic and genomic studies of addiction in humans and mouse models to highlight the ways in which CAM variation may contribute to a specific brain-based disorder such as addiction. Specific examples include changes in CAM mRNA splicing mediated by differences in the addiction-associated splicing regulator RBFOX1/A2BP1 and CAM expression in dopamine neurons. PMID:25988664

  3. Revealing complex function, process and pathway interactions with high-throughput expression and biological annotation data.

    PubMed

    Singh, Nitesh Kumar; Ernst, Mathias; Liebscher, Volkmar; Fuellen, Georg; Taher, Leila

    2016-10-20

    The biological relationships both between and within the functions, processes and pathways that operate within complex biological systems are only poorly characterized, making the interpretation of large scale gene expression datasets extremely challenging. Here, we present an approach that integrates gene expression and biological annotation data to identify and describe the interactions between biological functions, processes and pathways that govern a phenotype of interest. The product is a global, interconnected network, not of genes but of functions, processes and pathways, that represents the biological relationships within the system. We validated our approach on two high-throughput expression datasets describing organismal and organ development. Our findings are well supported by the available literature, confirming that developmental processes and apoptosis play key roles in cell differentiation. Furthermore, our results suggest that processes related to pluripotency and lineage commitment, which are known to be critical for development, interact mainly indirectly, through genes implicated in more general biological processes. Moreover, we provide evidence that supports the relevance of cell spatial organization in the developing liver for proper liver function. Our strategy can be viewed as an abstraction that is useful to interpret high-throughput data and devise further experiments.

  4. Exploratory Analysis of Biological Networks through Visualization, Clustering, and Functional Annotation in Cytoscape.

    PubMed

    Baryshnikova, Anastasia

    2016-01-01

    Biological networks define how genes, proteins, and other cellular components interact with one another to carry out specific functions, providing a scaffold for understanding cellular organization. Although in-depth network analysis requires advanced mathematical and computational knowledge, a preliminary visual exploration of biological networks is accessible to anyone with basic computer skills. Visualization of biological networks is used primarily to examine network topology, identify functional modules, and predict gene functions based on gene connectivity within the network. Networks are excellent at providing a bird's-eye view of data sets and have the power of illustrating complex ideas in simple and intuitive terms. In addition, they enable exploratory analysis and generation of new hypotheses, which can then be tested using rigorous statistical and experimental tools. This protocol describes a simple procedure for visualizing a biological network using the genetic interaction similarity network for Saccharomyces cerevisiae as an example. The visualization procedure described here relies on the open-source network visualization software Cytoscape and includes detailed instructions on formatting and loading the data, clustering networks, and overlaying functional annotations. PMID:26988373

  5. Validating Annotations for Uncharacterized Proteins in Shewanella oneidensis

    PubMed Central

    Louie, Brenton; Tarczy-Hornoch, Peter; Higdon, Roger

    2008-01-01

    Abstract Proteins of unknown function are a barrier to our understanding of molecular biology. Assigning function to these “uncharacterized” proteins is imperative, but challenging. The usual approach is similarity searches using annotation databases, which are useful for predicting function. However, since the performance of these databases on uncharacterized proteins is basically unknown, the accuracy of their predictions is suspect, making annotation difficult. To address this challenge, we developed a benchmark annotation dataset of 30 proteins in Shewanella oneidensis. The proteins in the dataset were originally uncharacterized after the initial annotation of the S. oneidensis proteome in 2002. In the intervening 5 years, the accumulation of new experimental evidence has enabled specific functions to be predicted. We utilized this benchmark dataset to evaluate several commonly utilized annotation databases. According to our criteria, six annotation databases accurately predicted functions for at least 60% of proteins in our dataset. Two of these six even had a “conditional accuracy” of 90%. Conditional accuracy is another evaluation metric we developed which excludes results from databases where no function was predicted. Also, 27 of the 30 proteins' functions were correctly predicted by at least one database. These represent one of the first performance evaluations of annotation databases on uncharacterized proteins. Our evaluation indicates that these databases readily incorporate new information and are accurate in predicting functions for uncharacterized proteins, provided that experimental function evidence exists. PMID:18687039

  6. Accurate estimators of correlation functions in Fourier space

    NASA Astrophysics Data System (ADS)

    Sefusatti, E.; Crocce, M.; Scoccimarro, R.; Couchman, H. M. P.

    2016-08-01

    Efficient estimators of Fourier-space statistics for large number of objects rely on fast Fourier transforms (FFTs), which are affected by aliasing from unresolved small-scale modes due to the finite FFT grid. Aliasing takes the form of a sum over images, each of them corresponding to the Fourier content displaced by increasing multiples of the sampling frequency of the grid. These spurious contributions limit the accuracy in the estimation of Fourier-space statistics, and are typically ameliorated by simultaneously increasing grid size and discarding high-frequency modes. This results in inefficient estimates for e.g. the power spectrum when desired systematic biases are well under per cent level. We show that using interlaced grids removes odd images, which include the dominant contribution to aliasing. In addition, we discuss the choice of interpolation kernel used to define density perturbations on the FFT grid and demonstrate that using higher order interpolation kernels than the standard Cloud-In-Cell algorithm results in significant reduction of the remaining images. We show that combining fourth-order interpolation with interlacing gives very accurate Fourier amplitudes and phases of density perturbations. This results in power spectrum and bispectrum estimates that have systematic biases below 0.01 per cent all the way to the Nyquist frequency of the grid, thus maximizing the use of unbiased Fourier coefficients for a given grid size and greatly reducing systematics for applications to large cosmological data sets.

  7. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies.

    PubMed

    Kichaev, Gleb; Pasaniuc, Bogdan

    2015-08-01

    Localization of causal variants underlying known risk loci is one of the main research challenges following genome-wide association studies. Risk loci are typically dissected through fine-mapping experiments in trans-ethnic cohorts for leveraging the variability in the local genetic structure across populations. More recent works have shown that genomic functional annotations (i.e., localization of tissue-specific regulatory marks) can be integrated for increasing fine-mapping performance within single-population studies. Here, we introduce methods that integrate the strength of association between genotype and phenotype, the variability in the genetic backgrounds across populations, and the genomic map of tissue-specific functional elements to increase trans-ethnic fine-mapping accuracy. Through extensive simulations and empirical data, we have demonstrated that our approach increases fine-mapping resolution over existing methods. We analyzed empirical data from a large-scale trans-ethnic rheumatoid arthritis (RA) study and showed that the functional genetic architecture of RA is consistent across European and Asian ancestries. In these data, we used our proposed methods to reduce the average size of the 90% credible set from 29 variants per locus for standard non-integrative approaches to 22 variants.

  8. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization)

    PubMed Central

    2010-01-01

    Background Searching the enormous amount of information available in biomedical literature to extract novel functional relationships among genes remains a challenge in the field of bioinformatics. While numerous (software) tools have been developed to extract and identify gene relationships from biological databases, few effectively deal with extracting new (or implied) gene relationships, a process which is useful in interpretation of discovery-oriented genome-wide experiments. Results In this study, we develop a Web-based bioinformatics software environment called FAUN or Feature Annotation Using Nonnegative matrix factorization (NMF) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of NMF for processing gene sets are discussed. FAUN is tested on three manually constructed gene document collections. Its utility and performance as a knowledge discovery tool is demonstrated using a set of genes associated with Autism. Conclusions FAUN not only assists researchers to use biomedical literature efficiently, but also provides utilities for knowledge discovery. This Web-based software environment may be useful for the validation and analysis of functional associations in gene subsets identified by high-throughput experiments. PMID:20946597

  9. Sequence- and Structure-Based Functional Annotation and Assessment of Metabolic Transporters in Aspergillus oryzae: A Representative Case Study

    PubMed Central

    Raethong, Nachon; Wong-ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa

    2016-01-01

    Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H+-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction. PMID:27274991

  10. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  11. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences

    PubMed Central

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C.; Rattei, Thomas; Mende, Daniel R.; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  12. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de.

  13. Comprehensive functional annotation of 18 missense mutations found in suspected hemochromatosis type 4 patients.

    PubMed

    Callebaut, Isabelle; Joubrel, Rozenn; Pissard, Serge; Kannengiesser, Caroline; Gérolami, Victoria; Ged, Cécile; Cadet, Estelle; Cartault, François; Ka, Chandran; Gourlaouen, Isabelle; Gourhant, Lénaick; Oudin, Claire; Goossens, Michel; Grandchamp, Bernard; De Verneuil, Hubert; Rochette, Jacques; Férec, Claude; Le Gac, Gérald

    2014-09-01

    Hemochromatosis type 4 is a rare form of primary iron overload transmitted as an autosomal dominant trait caused by mutations in the gene encoding the iron transport protein ferroportin 1 (SLC40A1). SLC40A1 mutations fall into two functional categories (loss- versus gain-of-function) underlying two distinct clinical entities (hemochromatosis type 4A versus type 4B). However, the vast majority of SLC40A1 mutations are rare missense variations, with only a few showing strong evidence of causality. The present study reports the results of an integrated approach collecting genetic and phenotypic data from 44 suspected hemochromatosis type 4 patients, with comprehensive structural and functional annotations. Causality was demonstrated for 10 missense variants, showing a clear dichotomy between the two hemochromatosis type 4 subtypes. Two subgroups of loss-of-function mutations were distinguished: one impairing cell-surface expression and one altering only iron egress. Additionally, a new gain-of-function mutation was identified, and the degradation of ferroportin on hepcidin binding was shown to probably depend on the integrity of a large extracellular loop outside of the hepcidin-binding domain. Eight further missense variations, on the other hand, were shown to have no discernible effects at either protein or RNA level; these were found in apparently isolated patients and were associated with a less severe phenotype. The present findings illustrate the importance of combining in silico and biochemical approaches to fully distinguish pathogenic SLC40A1 mutations from benign variants. This has profound implications for patient management.

  14. Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms

    PubMed Central

    Nasir, Arshan; Naeem, Aisha; Khan, Muhammad Jawad; Lopez-Nicora, Horacio D.; Caetano-Anollés, Gustavo

    2011-01-01

    The functional repertoire of a cell is largely embodied in its proteome, the collection of proteins encoded in the genome of an organism. The molecular functions of proteins are the direct consequence of their structure and structure can be inferred from sequence using hidden Markov models of structural recognition. Here we analyze the functional annotation of protein domain structures in almost a thousand sequenced genomes, exploring the functional and structural diversity of proteomes. We find there is a remarkable conservation in the distribution of domains with respect to the molecular functions they perform in the three superkingdoms of life. In general, most of the protein repertoire is spent in functions related to metabolic processes but there are significant differences in the usage of domains for regulatory and extra-cellular processes both within and between superkingdoms. Our results support the hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion mechanisms that were directed towards innovating new domain architectures for regulatory and extra/intracellular process functions needed for example to maintain the integrity of multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell signaling and adhesion, immune responses, and toxin production). Proteomes of microbial superkingdoms Archaea and Bacteria retained fewer numbers of domains and maintained simple and smaller protein repertoires. Viruses appear to play an important role in the evolution of superkingdoms. We finally identify few genomic outliers that deviate significantly from the conserved functional design. These include Nanoarchaeum equitans, proteobacterial symbionts of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms spend most of their domains on information functions, including translation and transcription, rather than on metabolism and harbor a domain repertoire characteristic of

  15. GOsummaries: an R Package for Visual Functional Annotation of Experimental Data.

    PubMed

    Kolde, Raivo; Vilo, Jaak

    2015-01-01

    Functional characterisation of gene lists using Gene Ontology (GO) enrichment analysis is a common approach in computational biology, since many analysis methods end up with a list of genes as a result. Often there can be hundreds of functional terms that are significantly associated with a single list of genes and proper interpretation of such results can be a challenging endeavour. There are methods to visualise and aid the interpretation of these results, but most of them are limited to the results associated with one list of genes. However, in practice the number of gene lists can be considerably higher and common tools are not effective in such situations. We introduce a novel R package, 'GOsummaries' that visualises the GO enrichment results as concise word clouds that can be combined together if the number of gene lists is larger. By also adding the graphs of corresponding raw experimental data, GOsummaries can create informative summary plots for various analyses such as differential expression or clustering. The case studies show that the GOsummaries plots allow rapid functional characterisation of complex sets of gene lists. The GOsummaries approach is particularly effective for Principal Component Analysis (PCA). By adding functional annotation to the principal components, GOsummaries improves  significantly the interpretability of PCA results. The GOsummaries layout for PCA can be effective even in situations where we cannot directly apply the GO analysis. For example, in case of metabolomics or metagenomics data it is possible to show the features with significant associations to the components instead of GO terms.   The GOsummaries package is available under GPL-2 licence at Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/GOsummaries.html).

  16. In Silico Functional Pathway Annotation of 86 Established Prostate Cancer Risk Variants

    PubMed Central

    Loo, Lenora W. M.; Fong, Aaron Y. W.; Cheng, Iona; Le Marchand, Loïc

    2015-01-01

    Heritability is one of the strongest risk factors of prostate cancer, emphasizing the importance of the genetic contribution towards prostate cancer risk. To date, 86 established prostate cancer risk variants have been identified by genome-wide association studies (GWAS). To determine if these risk variants are located near genes that interact together in biological networks or pathways contributing to prostate cancer initiation or progression, we generated gene sets based on proximity to the 86 prostate cancer risk variants. We took two approaches to generate gene lists. The first strategy included all immediate flanking genes, up- and downstream of the risk variant, regardless of distance from the index variant, and the second strategy included genes closest to the index GWAS marker and to variants in high LD (r2 ≥0.8 in Europeans) with the index variant, within a 100 kb window up- and downstream. Pathway mapping of the two gene sets supported the importance of the androgen receptor-mediated signaling in prostate cancer biology. In addition, the hedgehog and Wnt/β-catenin signaling pathways were identified in pathway mapping for the flanking gene set. We also used the HaploReg resource to examine the 86 risk loci and variants high LD (r2 ≥0.8) for functional elements. We found that there was a 12.8 fold (p = 2.9 x 10-4) enrichment for enhancer motifs in a stem cell line and a 4.4 fold (p = 1.1 x 10-3) enrichment of DNase hypersensitivity in a prostate adenocarcinoma cell line, indicating that the risk and correlated variants are enriched for transcriptional regulatory motifs. Our pathway-based functional annotation of the prostate cancer risk variants highlights the potential regulatory function that GWAS risk markers, and their highly correlated variants, exert on genes. Our study also shows that these genes may function cooperatively in key signaling pathways in prostate cancer biology. PMID:25658610

  17. RNA-seq analysis of Quercus pubescens Leaves: de novo transcriptome assembly, annotation and functional markers development.

    PubMed

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences.

  18. RNA-Seq Analysis of Quercus pubescens Leaves: De Novo Transcriptome Assembly, Annotation and Functional Markers Development

    PubMed Central

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

  19. De novo assembly and functional annotation of the olive (Olea europaea) transcriptome.

    PubMed

    Muñoz-Mérida, Antonio; González-Plaza, Juan José; Cañada, Andrés; Blanco, Ana María; García-López, Maria del Carmen; Rodríguez, José Manuel; Pedrola, Laia; Sicardo, M Dolores; Hernández, M Luisa; De la Rosa, Raúl; Belaj, Angjelina; Gil-Borja, Mayte; Luque, Francisco; Martínez-Rivas, José Manuel; Pisano, David G; Trelles, Oswaldo; Valpuesta, Victoriano; Beuzón, Carmen R

    2013-02-01

    Olive breeding programmes are focused on selecting for traits as short juvenile period, plant architecture suited for mechanical harvest, or oil characteristics, including fatty acid composition, phenolic, and volatile compounds to suit new markets. Understanding the molecular basis of these characteristics and improving the efficiency of such breeding programmes require the development of genomic information and tools. However, despite its economic relevance, genomic information on olive or closely related species is still scarce. We have applied Sanger and 454 pyrosequencing technologies to generate close to 2 million reads from 12 cDNA libraries obtained from the Picual, Arbequina, and Lechin de Sevilla cultivars and seedlings from a segregating progeny of a Picual × Arbequina cross. The libraries include fruit mesocarp and seeds at three relevant developmental stages, young stems and leaves, active juvenile and adult buds as well as dormant buds, and juvenile and adult roots. The reads were assembled by library or tissue and then assembled together into 81 020 unigenes with an average size of 496 bases. Here, we report their assembly and their functional annotation.

  20. Cross-Population Joint Analysis of eQTLs: Fine Mapping and Functional Annotation

    PubMed Central

    Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger

    2015-01-01

    Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the genetic underpinnings of many complex traits at molecular level. In this paper, we present an integrative analysis approach that leverages eQTL data collected from multiple population groups. In particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistent across populations, accounting for population heterogeneity in allele frequencies and linkage disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework enables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the resolution of fine mapping of causal eQTL ii) many genes harbor multiple independent eQTLs in their cis regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs (p-value = 4.93 × 10-22). PMID:25906321

  1. Comparative annotation of functional regions in the human genome using epigenomic data.

    PubMed

    Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

    2013-04-01

    Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type-specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes.

  2. De Novo Assembly and Functional Annotation of the Olive (Olea europaea) Transcriptome

    PubMed Central

    Muñoz-Mérida, Antonio; González-Plaza, Juan José; Cañada, Andrés; Blanco, Ana María; García-López, Maria del Carmen; Rodríguez, José Manuel; Pedrola, Laia; Sicardo, M. Dolores; Hernández, M. Luisa; De la Rosa, Raúl; Belaj, Angjelina; Gil-Borja, Mayte; Luque, Francisco; Martínez-Rivas, José Manuel; Pisano, David G.; Trelles, Oswaldo; Valpuesta, Victoriano; Beuzón, Carmen R.

    2013-01-01

    Olive breeding programmes are focused on selecting for traits as short juvenile period, plant architecture suited for mechanical harvest, or oil characteristics, including fatty acid composition, phenolic, and volatile compounds to suit new markets. Understanding the molecular basis of these characteristics and improving the efficiency of such breeding programmes require the development of genomic information and tools. However, despite its economic relevance, genomic information on olive or closely related species is still scarce. We have applied Sanger and 454 pyrosequencing technologies to generate close to 2 million reads from 12 cDNA libraries obtained from the Picual, Arbequina, and Lechin de Sevilla cultivars and seedlings from a segregating progeny of a Picual × Arbequina cross. The libraries include fruit mesocarp and seeds at three relevant developmental stages, young stems and leaves, active juvenile and adult buds as well as dormant buds, and juvenile and adult roots. The reads were assembled by library or tissue and then assembled together into 81 020 unigenes with an average size of 496 bases. Here, we report their assembly and their functional annotation. PMID:23297299

  3. Automated annotation of functional imaging experiments via multi-label classification

    PubMed Central

    Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112

  4. Automated annotation of functional imaging experiments via multi-label classification.

    PubMed

    Turner, Matthew D; Chakrabarti, Chayan; Jones, Thomas B; Xu, Jiawei F; Fox, Peter T; Luger, George F; Laird, Angela R; Turner, Jessica A

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text.

  5. Insect genome content phylogeny and functional annotation of core insect genomes.

    PubMed

    Rosenfeld, Jeffrey A; Foox, Jonathan; DeSalle, Rob

    2016-04-01

    Twenty-one fully sequenced and well annotated insect genomes were examined for genome content in a phylogenetic context. Gene presence/absence matrices and phylogenetic trees were constructed using several phylogenetic criteria. The role of e-value on phylogenetic analysis and genome content characterization is examined using scaled e-value cutoffs and a single linkage clustering approach to orthology determination. Previous studies have focused on the role of gene loss in terminals in the insect tree of life. The present study examines several common ancestral nodes in the insect tree. We suggest that the common ancestors of major insect groups like Diptera, Hymenoptera, Hemiptera and Holometabola experience more gene gain than gene loss. This suggests that as major insect groups arose, their genomic repertoire expanded through gene duplication (segmental duplications), followed by contraction by gene loss in specific terminal lineages. In addition, we examine the functional significance of the loss and gain of genes in the divergence of some of the major insect groups. PMID:26549428

  6. Cellular functions of genetically imprinted genes in human and mouse as annotated in the gene ontology.

    PubMed

    Hamed, Mohamed; Ismael, Siba; Paulsen, Martina; Helms, Volkhard

    2012-01-01

    By analyzing the cellular functions of genetically imprinted genes as annotated in the Gene Ontology for human and mouse, we found that imprinted genes are often involved in developmental, transport and regulatory processes. In the human, paternally expressed genes are enriched in GO terms related to the development of organs and of anatomical structures. In the mouse, maternally expressed genes regulate cation transport as well as G-protein signaling processes. Furthermore, we investigated if imprinted genes are regulated by common transcription factors. We identified 25 TF families that showed an enrichment of binding sites in the set of imprinted genes in human and 40 TF families in mouse. In general, maternally and paternally expressed genes are not regulated by different transcription factors. The genes Nnat, Klf14, Blcap, Gnas and Ube3a contribute most to the enrichment of TF families. In the mouse, genes that are maternally expressed in placenta are enriched for AP1 binding sites. In the human, we found that these genes possessed binding sites for both, AP1 and SP1. PMID:23226257

  7. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis.

    PubMed

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-12-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or 'expressology', thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au).

  8. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis

    PubMed Central

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-01-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765

  9. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

    PubMed

    Arloth, Janine; Bader, Daniel M; Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer's annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2.

  10. Human cell adhesion molecules: annotated functional subtypes and overrepresentation of addiction-associated genes

    PubMed Central

    Zhong, Xiaoming; Drgonova, Jana; Li, Chuan-Yun; Uhl, George R.

    2015-01-01

    Human cell adhesion molecules (CAMs) are essential both for a) proper development, modulation and maintenance of interactions between cells and for b) cell-to-cell (and matrix-to-cell) communication about these interactions. CAMs are thus key to proper development and plasticity of organs and tissues that include the brain. Despite recognition of the existence of these dual CAM roles and appreciation of the differential functional significance of these roles, there have been surprisingly few systematic studies that have carefully enumerated the universe of CAMs, identified the preferred roles for specific CAMs in distinct types of cellular connections and communication, or related these issues to specific brain disorders or brain circuits. In this paper, we substantially update and review the set of human genes that are likely to encode CAMs based on searches of databases, literature reviews and annotations. We describe the likely CAMs and the functional CAM subclasses into which they fall. These include “iCAMs”, whose contacts largely mediate cell to cell communication, those involved in focal adhesions, CAM genes whose products are preferentially involved with stereotyped and morphologically-identifiable connections between cells (adherens junctions, gap junctions) and smaller numbers of genes in other classes. We discuss a novel proposed mechanism involving selective anchoring of the constituents of iCAM-containing lipid rafts in zones of close neuronal apposition to membranes expressing binding partners of these iCAMs. CAM data from genetic and genomic studies of addiction in humans and mouse models provide examples of the ways in which CAM variation is likely to contribute to a specific brain-based disorder. We discuss how differences in CAM splicing mediated by differences in the addiction-associated splicing regulator RBFOX1/A2BP1 could enrich this picture. CAM expression in dopamine neurons provides one of the ways in which variations in cell adhesion

  11. Functional annotation of the transcriptome of Sorghum bicolor in response to osmotic stress and abscisic acid

    PubMed Central

    2011-01-01

    Background Higher plants exhibit remarkable phenotypic plasticity allowing them to adapt to an extensive range of environmental conditions. Sorghum is a cereal crop that exhibits exceptional tolerance to adverse conditions, in particular, water-limiting environments. This study utilized next generation sequencing (NGS) technology to examine the transcriptome of sorghum plants challenged with osmotic stress and exogenous abscisic acid (ABA) in order to elucidate genes and gene networks that contribute to sorghum's tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought. Results RNA-Seq results revealed transcriptional activity of 28,335 unique genes from sorghum root and shoot tissues subjected to polyethylene glycol (PEG)-induced osmotic stress or exogenous ABA. Differential gene expression analyses in response to osmotic stress and ABA revealed a strong interplay among various metabolic pathways including abscisic acid and 13-lipoxygenase, salicylic acid, jasmonic acid, and plant defense pathways. Transcription factor analysis indicated that groups of genes may be co-regulated by similar regulatory sequences to which the expressed transcription factors bind. We successfully exploited the data presented here in conjunction with published transcriptome analyses for rice, maize, and Arabidopsis to discover more than 50 differentially expressed, drought-responsive gene orthologs for which no function had been previously ascribed. Conclusions The present study provides an initial assemblage of sorghum genes and gene networks regulated by osmotic stress and hormonal treatment. We are providing an RNA-Seq data set and an initial collection of transcription factors, which offer a preliminary look into the cascade of global gene expression patterns that arise in a drought tolerant crop subjected to abiotic stress. These resources will allow scientists to query gene expression and functional

  12. Quantifying structure-function uncertainty: a graph theoretical exploration into the origins and limitations of protein annotation.

    PubMed

    Shakhnovich, Boris E; Max Harvey, J

    2004-04-01

    Since the advent of investigations into structural genomics, research has focused on correctly identifying domain boundaries, as well as domain similarities and differences in the context of their evolutionary relationships. As the science of structural genomics ramps up adding more and more information into the databanks, questions about the accuracy and completeness of our classification and annotation systems appear on the forefront of this research. A central question of paramount importance is how structural similarity relates to functional similarity. Here, we begin to rigorously and quantitatively answer these questions by first exploring the consensus between the most common protein domain structure annotation databases CATH, SCOP and FSSP. Each of these databases explores the evolutionary relationships between protein domains using a combination of automatic and manual, structural and functional, continuous and discrete similarity measures. In order to examine the issue of consensus thoroughly, we build a generalized graph out of each of these databases and hierarchically cluster these graphs at interval thresholds. We then employ a distance measure to find regions of greatest overlap. Using this procedure we were able not only to enumerate the level of consensus between the different annotation systems, but also to define the graph-theoretical origins behind the annotation schema of class, family and superfamily by observing that the same thresholds that define the best consensus regions between FSSP, SCOP and CATH correspond to distinct, non-random phase-transitions in the structure comparison graph itself. To investigate the correspondence in divergence between structure and function further, we introduce a measure of functional entropy that calculates divergence in function space. First, we use this measure to calculate the general correlation between structural homology and functional proximity. We extend this analysis further by quantitatively

  13. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus.

    PubMed

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen; Arnqvist, Göran

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally. PMID:27442123

  14. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus

    PubMed Central

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally. PMID:27442123

  15. Culturable diversity and functional annotation of psychrotrophic bacteria from cold desert of Leh Ladakh (India).

    PubMed

    Yadav, Ajar Nath; Sachan, Shashwati Ghosh; Verma, Priyanka; Tyagi, Satya Prakash; Kaushik, Rajeev; Saxena, Anil K

    2015-01-01

    To study culturable bacterial diversity under subzero temperature conditions and their possible functional annotation, soil and water samples from Leh Ladakh region were analysed. Ten different nutrient combinations were used to isolate the maximum possible culturable morphotypes. A total of 325 bacterial isolates were characterized employing 16S rDNA-Amplified Ribosomal DNA Restriction Analysis with three restriction endonucleases AluI, MspI and HaeIII, which led to formation of 23-40 groups for the different sites at 75 % similarity index, adding up to 175 groups. Phylogenetic analysis based on 16S rRNA gene sequencing led to the identification of 175 bacteria, grouped in four phyla, Firmicutes (54 %), Proteobacteria (28 %), Actinobacteria (16 %) and Bacteroidetes (3 %), and included 29 different genera with 57 distinct species. Overall 39 % of the total morphotypes belonged to the Bacillus and Bacillus derived genera (BBDG) followed by Pseudomonas (14 %), Arthrobacter (9 %), Exiguobacterium (8 %), Alishewanella (4 %), Brachybacterium, Providencia, Planococcus (3 %), Janthinobacterium, Sphingobacterium, Kocuria (2 %) and Aurantimonas, Citricoccus, Cellulosimicrobium, Brevundimonas, Desemzia, Flavobacterium, Klebsiella, Paracoccus, Psychrobacter, Sporosarcina, Staphylococcus, Sinobaca, Stenotrophomonas, Sanguibacter, Vibrio (1 %). The representative isolates from each cluster were screened for their plant growth promoting characteristics at low temperature (5-15 °C). Variations were observed among strains for production of ammonia, hydrogen cyanide, indole-3-acetic acid and siderophore, solubilisation of phosphate, 1-aminocyclopropane-1-carboxylate deaminase activity and biocontrol activity against Rhizoctonia solani and Macrophomina phaseolina. Cold adapted microbes may have application as inoculants and biocontrol agents in crops growing at high altitudes under cold climate condition.

  16. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    PubMed Central

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  17. The power of EST sequence data: Relation to Acyrthosiphon pisum genome annotation and functional genomics initiatives

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genes important to aphid biology, survival and reproduction were successfully identified by use of a genomics approach. We created and described the Sequencing, compilation, and annotation of the approxiamtely 525Mb nuclear genome of the pea aphid, Acyrthosiphon pisum, which represents an important ...

  18. Trans-ethnic Meta-analysis and Functional Annotation Illuminates the Genetic Architecture of Fasting Glucose and Insulin.

    PubMed

    Liu, Ching-Ti; Raghavan, Sridharan; Maruthur, Nisa; Kabagambe, Edmond Kato; Hong, Jaeyoung; Ng, Maggie C Y; Hivert, Marie-France; Lu, Yingchang; An, Ping; Bentley, Amy R; Drolet, Anne M; Gaulton, Kyle J; Guo, Xiuqing; Armstrong, Loren L; Irvin, Marguerite R; Li, Man; Lipovich, Leonard; Rybin, Denis V; Taylor, Kent D; Agyemang, Charles; Palmer, Nicholette D; Cade, Brian E; Chen, Wei-Min; Dauriz, Marco; Delaney, Joseph A C; Edwards, Todd L; Evans, Daniel S; Evans, Michele K; Lange, Leslie A; Leong, Aaron; Liu, Jingmin; Liu, Yongmei; Nayak, Uma; Patel, Sanjay R; Porneala, Bianca C; Rasmussen-Torvik, Laura J; Snijder, Marieke B; Stallings, Sarah C; Tanaka, Toshiko; Yanek, Lisa R; Zhao, Wei; Becker, Diane M; Bielak, Lawrence F; Biggs, Mary L; Bottinger, Erwin P; Bowden, Donald W; Chen, Guanjie; Correa, Adolfo; Couper, David J; Crawford, Dana C; Cushman, Mary; Eicher, John D; Fornage, Myriam; Franceschini, Nora; Fu, Yi-Ping; Goodarzi, Mark O; Gottesman, Omri; Hara, Kazuo; Harris, Tamara B; Jensen, Richard A; Johnson, Andrew D; Jhun, Min A; Karter, Andrew J; Keller, Margaux F; Kho, Abel N; Kizer, Jorge R; Krauss, Ronald M; Langefeld, Carl D; Li, Xiaohui; Liang, Jingling; Liu, Simin; Lowe, William L; Mosley, Thomas H; North, Kari E; Pacheco, Jennifer A; Peyser, Patricia A; Patrick, Alan L; Rice, Kenneth M; Selvin, Elizabeth; Sims, Mario; Smith, Jennifer A; Tajuddin, Salman M; Vaidya, Dhananjay; Wren, Mary P; Yao, Jie; Zhu, Xiaofeng; Ziegler, Julie T; Zmuda, Joseph M; Zonderman, Alan B; Zwinderman, Aeilko H; Adeyemo, Adebowale; Boerwinkle, Eric; Ferrucci, Luigi; Hayes, M Geoffrey; Kardia, Sharon L R; Miljkovic, Iva; Pankow, James S; Rotimi, Charles N; Sale, Michele M; Wagenknecht, Lynne E; Arnett, Donna K; Chen, Yii-Der Ida; Nalls, Michael A; Province, Michael A; Kao, W H Linda; Siscovick, David S; Psaty, Bruce M; Wilson, James G; Loos, Ruth J F; Dupuis, Josée; Rich, Stephen S; Florez, Jose C; Rotter, Jerome I; Morris, Andrew P; Meigs, James B

    2016-07-01

    Knowledge of the genetic basis of the type 2 diabetes (T2D)-related quantitative traits fasting glucose (FG) and insulin (FI) in African ancestry (AA) individuals has been limited. In non-diabetic subjects of AA (n = 20,209) and European ancestry (EA; n = 57,292), we performed trans-ethnic (AA+EA) fine-mapping of 54 established EA FG or FI loci with detailed functional annotation, assessed their relevance in AA individuals, and sought previously undescribed loci through trans-ethnic (AA+EA) meta-analysis. We narrowed credible sets of variants driving association signals for 22/54 EA-associated loci; 18/22 credible sets overlapped with active islet-specific enhancers or transcription factor (TF) binding sites, and 21/22 contained at least one TF motif. Of the 54 EA-associated loci, 23 were shared between EA and AA. Replication with an additional 10,096 AA individuals identified two previously undescribed FI loci, chrX FAM133A (rs213676) and chr5 PELO (rs6450057). Trans-ethnic analyses with regulatory annotation illuminate the genetic architecture of glycemic traits and suggest gene regulation as a target to advance precision medicine for T2D. Our approach to utilize state-of-the-art functional annotation and implement trans-ethnic association analysis for discovery and fine-mapping offers a framework for further follow-up and characterization of GWAS signals of complex trait loci. PMID:27321945

  19. Trans-ethnic Meta-analysis and Functional Annotation Illuminates the Genetic Architecture of Fasting Glucose and Insulin.

    PubMed

    Liu, Ching-Ti; Raghavan, Sridharan; Maruthur, Nisa; Kabagambe, Edmond Kato; Hong, Jaeyoung; Ng, Maggie C Y; Hivert, Marie-France; Lu, Yingchang; An, Ping; Bentley, Amy R; Drolet, Anne M; Gaulton, Kyle J; Guo, Xiuqing; Armstrong, Loren L; Irvin, Marguerite R; Li, Man; Lipovich, Leonard; Rybin, Denis V; Taylor, Kent D; Agyemang, Charles; Palmer, Nicholette D; Cade, Brian E; Chen, Wei-Min; Dauriz, Marco; Delaney, Joseph A C; Edwards, Todd L; Evans, Daniel S; Evans, Michele K; Lange, Leslie A; Leong, Aaron; Liu, Jingmin; Liu, Yongmei; Nayak, Uma; Patel, Sanjay R; Porneala, Bianca C; Rasmussen-Torvik, Laura J; Snijder, Marieke B; Stallings, Sarah C; Tanaka, Toshiko; Yanek, Lisa R; Zhao, Wei; Becker, Diane M; Bielak, Lawrence F; Biggs, Mary L; Bottinger, Erwin P; Bowden, Donald W; Chen, Guanjie; Correa, Adolfo; Couper, David J; Crawford, Dana C; Cushman, Mary; Eicher, John D; Fornage, Myriam; Franceschini, Nora; Fu, Yi-Ping; Goodarzi, Mark O; Gottesman, Omri; Hara, Kazuo; Harris, Tamara B; Jensen, Richard A; Johnson, Andrew D; Jhun, Min A; Karter, Andrew J; Keller, Margaux F; Kho, Abel N; Kizer, Jorge R; Krauss, Ronald M; Langefeld, Carl D; Li, Xiaohui; Liang, Jingling; Liu, Simin; Lowe, William L; Mosley, Thomas H; North, Kari E; Pacheco, Jennifer A; Peyser, Patricia A; Patrick, Alan L; Rice, Kenneth M; Selvin, Elizabeth; Sims, Mario; Smith, Jennifer A; Tajuddin, Salman M; Vaidya, Dhananjay; Wren, Mary P; Yao, Jie; Zhu, Xiaofeng; Ziegler, Julie T; Zmuda, Joseph M; Zonderman, Alan B; Zwinderman, Aeilko H; Adeyemo, Adebowale; Boerwinkle, Eric; Ferrucci, Luigi; Hayes, M Geoffrey; Kardia, Sharon L R; Miljkovic, Iva; Pankow, James S; Rotimi, Charles N; Sale, Michele M; Wagenknecht, Lynne E; Arnett, Donna K; Chen, Yii-Der Ida; Nalls, Michael A; Province, Michael A; Kao, W H Linda; Siscovick, David S; Psaty, Bruce M; Wilson, James G; Loos, Ruth J F; Dupuis, Josée; Rich, Stephen S; Florez, Jose C; Rotter, Jerome I; Morris, Andrew P; Meigs, James B

    2016-07-01

    Knowledge of the genetic basis of the type 2 diabetes (T2D)-related quantitative traits fasting glucose (FG) and insulin (FI) in African ancestry (AA) individuals has been limited. In non-diabetic subjects of AA (n = 20,209) and European ancestry (EA; n = 57,292), we performed trans-ethnic (AA+EA) fine-mapping of 54 established EA FG or FI loci with detailed functional annotation, assessed their relevance in AA individuals, and sought previously undescribed loci through trans-ethnic (AA+EA) meta-analysis. We narrowed credible sets of variants driving association signals for 22/54 EA-associated loci; 18/22 credible sets overlapped with active islet-specific enhancers or transcription factor (TF) binding sites, and 21/22 contained at least one TF motif. Of the 54 EA-associated loci, 23 were shared between EA and AA. Replication with an additional 10,096 AA individuals identified two previously undescribed FI loci, chrX FAM133A (rs213676) and chr5 PELO (rs6450057). Trans-ethnic analyses with regulatory annotation illuminate the genetic architecture of glycemic traits and suggest gene regulation as a target to advance precision medicine for T2D. Our approach to utilize state-of-the-art functional annotation and implement trans-ethnic association analysis for discovery and fine-mapping offers a framework for further follow-up and characterization of GWAS signals of complex trait loci.

  20. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  1. Enriching the annotation of Mycobacterium tuberculosis H37Rv proteome using remote homology detection approaches: insights into structure and function.

    PubMed

    Ramakrishnan, Gayatri; Ochoa-Montaño, Bernardo; Raghavender, Upadhyayula S; Mudgal, Richa; Joshi, Adwait G; Chandra, Nagasuma R; Sowdhamini, Ramanathan; Blundell, Tom L; Srinivasan, Narayanaswamy

    2015-01-01

    The availability of the genome sequence of Mycobacterium tuberculosis H37Rv has encouraged determination of large numbers of protein structures and detailed definition of the biological information encoded therein; yet, the functions of many proteins in M. tuberculosis remain unknown. The emergence of multidrug resistant strains makes it a priority to exploit recent advances in homology recognition and structure prediction to re-analyse its gene products. Here we report the structural and functional characterization of gene products encoded in the M. tuberculosis genome, with the help of sensitive profile-based remote homology search and fold recognition algorithms resulting in an enhanced annotation of the proteome where 95% of the M. tuberculosis proteins were identified wholly or partly with information on structure or function. New information includes association of 244 proteins with 205 domain families and a separate set of new association of folds to 64 proteins. Extending structural information across uncharacterized protein families represented in the M. tuberculosis proteome, by determining superfamily relationships between families of known and unknown structures, has contributed to an enhancement in the knowledge of structural content. In retrospect, such superfamily relationships have facilitated recognition of probable structure and/or function for several uncharacterized protein families, eventually aiding recognition of probable functions for homologous proteins corresponding to such families. Gene products unique to mycobacteria for which no functions could be identified are 183. Of these 18 were determined to be M. tuberculosis specific. Such pathogen-specific proteins are speculated to harbour virulence factors required for pathogenesis. A re-annotated proteome of M. tuberculosis, with greater completeness of annotated proteins and domain assigned regions, provides a valuable basis for experimental endeavours designed to obtain a better

  2. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants.

    PubMed

    Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo

    2015-08-24

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.

  3. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

    PubMed Central

    Gagliano, Sarah A.; Ravji, Reena; Barnes, Michael R.; Weale, Michael E.; Knight, Jo

    2015-01-01

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64–0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies. PMID:26300220

  4. Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation.

    PubMed

    Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias

    2014-08-01

    Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications.

  5. Generation, functional annotation and comparative analysis of black spruce (Picea mariana) ESTs: an important conifer genomic resource

    PubMed Central

    2013-01-01

    Background EST (expressed sequence tag) sequences and their annotation provide a highly valuable resource for gene discovery, genome sequence annotation, and other genomics studies that can be applied in genetics, breeding and conservation programs for non-model organisms. Conifers are long-lived plants that are ecologically and economically important globally, and have a large genome size. Black spruce (Picea mariana), is a transcontinental species of the North American boreal and temperate forests. However, there are limited transcriptomic and genomic resources for this species. The primary objective of our study was to develop a black spruce transcriptomic resource to facilitate on-going functional genomics projects related to growth and adaptation to climate change. Results We conducted bidirectional sequencing of cDNA clones from a standard cDNA library constructed from black spruce needle tissues. We obtained 4,594 high quality (2,455 5' end and 2,139 3' end) sequence reads, with an average read-length of 532 bp. Clustering and assembly of ESTs resulted in 2,731 unique sequences, consisting of 2,234 singletons and 497 contigs. Approximately two-thirds (63%) of unique sequences were functionally annotated. Genes involved in 36 molecular functions and 90 biological processes were discovered, including 24 putative transcription factors and 232 genes involved in photosynthesis. Most abundantly expressed transcripts were associated with photosynthesis, growth factors, stress and disease response, and transcription factors. A total of 216 full-length genes were identified. About 18% (493) of the transcripts were novel, representing an important addition to the Genbank EST database (dbEST). Fifty-seven di-, tri-, tetra- and penta-nucleotide simple sequence repeats were identified. Conclusions We have developed the first high quality EST resource for black spruce and identified 493 novel transcripts, which may be species-specific related to life history and

  6. BIOFILTER AS A FUNCTIONAL ANNOTATION PIPELINE FOR COMMON AND RARE COPY NUMBER BURDEN.

    PubMed

    Kim, Dokyoon; Lucas, Anastasia; Glessner, Joseph; Verma, Shefali S; Bradford, Yuki; Li, Ruowang; Frase, Alex T; Hakonarson, Hakon; Peissig, Peggy; Brilliant, Murray; Ritchie, Marylyn D

    2016-01-01

    Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter - a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record - total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the interpretability of

  7. Accurate Semilocal Density Functional for Condensed-Matter Physics and Quantum Chemistry

    NASA Astrophysics Data System (ADS)

    Tao, Jianmin; Mo, Yuxiang

    2016-08-01

    Most density functionals have been developed by imposing the known exact constraints on the exchange-correlation energy, or by a fit to a set of properties of selected systems, or by both. However, accurate modeling of the conventional exchange hole presents a great challenge, due to the delocalization of the hole. Making use of the property that the hole can be made localized under a general coordinate transformation, here we derive an exchange hole from the density matrix expansion, while the correlation part is obtained by imposing the low-density limit constraint. From the hole, a semilocal exchange-correlation functional is calculated. Our comprehensive test shows that this functional can achieve remarkable accuracy for diverse properties of molecules, solids, and solid surfaces, substantially improving upon the nonempirical functionals proposed in recent years. Accurate semilocal functionals based on their associated holes are physically appealing and practically useful for developing nonlocal functionals.

  8. Accurate Semilocal Density Functional for Condensed-Matter Physics and Quantum Chemistry.

    PubMed

    Tao, Jianmin; Mo, Yuxiang

    2016-08-12

    Most density functionals have been developed by imposing the known exact constraints on the exchange-correlation energy, or by a fit to a set of properties of selected systems, or by both. However, accurate modeling of the conventional exchange hole presents a great challenge, due to the delocalization of the hole. Making use of the property that the hole can be made localized under a general coordinate transformation, here we derive an exchange hole from the density matrix expansion, while the correlation part is obtained by imposing the low-density limit constraint. From the hole, a semilocal exchange-correlation functional is calculated. Our comprehensive test shows that this functional can achieve remarkable accuracy for diverse properties of molecules, solids, and solid surfaces, substantially improving upon the nonempirical functionals proposed in recent years. Accurate semilocal functionals based on their associated holes are physically appealing and practically useful for developing nonlocal functionals.

  9. Accurate Semilocal Density Functional for Condensed-Matter Physics and Quantum Chemistry.

    PubMed

    Tao, Jianmin; Mo, Yuxiang

    2016-08-12

    Most density functionals have been developed by imposing the known exact constraints on the exchange-correlation energy, or by a fit to a set of properties of selected systems, or by both. However, accurate modeling of the conventional exchange hole presents a great challenge, due to the delocalization of the hole. Making use of the property that the hole can be made localized under a general coordinate transformation, here we derive an exchange hole from the density matrix expansion, while the correlation part is obtained by imposing the low-density limit constraint. From the hole, a semilocal exchange-correlation functional is calculated. Our comprehensive test shows that this functional can achieve remarkable accuracy for diverse properties of molecules, solids, and solid surfaces, substantially improving upon the nonempirical functionals proposed in recent years. Accurate semilocal functionals based on their associated holes are physically appealing and practically useful for developing nonlocal functionals. PMID:27563956

  10. Accurate FDTD modelling for dispersive media using rational function and particle swarm optimisation

    NASA Astrophysics Data System (ADS)

    Chung, Haejun; Ha, Sang-Gyu; Choi, Jaehoon; Jung, Kyung-Young

    2015-07-01

    This article presents an accurate finite-difference time domain (FDTD) dispersive modelling suitable for complex dispersive media. A quadratic complex rational function (QCRF) is used to characterise their dispersive relations. To obtain accurate coefficients of QCRF, in this work, we use an analytical approach and a particle swarm optimisation (PSO) simultaneously. In specific, an analytical approach is used to obtain the QCRF matrix-solving equation and PSO is applied to adjust a weighting function of this equation. Numerical examples are used to illustrate the validity of the proposed FDTD dispersion model.

  11. Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach

    PubMed Central

    2014-01-01

    Background Induced resistance (IR) can be part of a sustainable plant protection strategy against important plant diseases. β-aminobutyric acid (BABA) can induce resistance in a wide range of plants against several types of pathogens, including potato infected with Phytophthora infestans. However, the molecular mechanisms behind this are unclear and seem to be dependent on the system studied. To elucidate the defence responses activated by BABA in potato, a genome-wide transcript microarray analysis in combination with label-free quantitative proteomics analysis of the apoplast secretome were performed two days after treatment of the leaf canopy with BABA at two concentrations, 1 and 10 mM. Results Over 5000 transcripts were differentially expressed and over 90 secretome proteins changed in abundance indicating a massive activation of defence mechanisms with 10 mM BABA, the concentration effective against late blight disease. To aid analysis, we present a more comprehensive functional annotation of the microarray probes and gene models by retrieving information from orthologous gene families across 26 sequenced plant genomes. The new annotation provided GO terms to 8616 previously un-annotated probes. Conclusions BABA at 10 mM affected several processes related to plant hormones and amino acid metabolism. A major accumulation of PR proteins was also evident, and in the mevalonate pathway, genes involved in sterol biosynthesis were down-regulated, whereas several enzymes involved in the sesquiterpene phytoalexin biosynthesis were up-regulated. Interestingly, abscisic acid (ABA) responsive genes were not as clearly regulated by BABA in potato as previously reported in Arabidopsis. Together these findings provide candidates and markers for improved resistance in potato, one of the most important crops in the world. PMID:24773703

  12. More accurate fitting of {sup 125}I and {sup 103}Pd radial dose functions

    SciTech Connect

    Taylor, R. E. P.; Rogers, D. W. O.

    2008-09-15

    In this study an improved functional form for fitting the radial dose functions, g(r), of {sup 125}I and {sup 103}Pd brachytherapy seeds is presented. The new function is capable of accurately fitting radial dose functions over ranges as large as 0.05 cm{<=}r{<=}10 cm for {sup 125}I seeds and 0.10 cm{<=}r{<=}10 cm for {sup 103}Pd seeds. The average discrepancies between fit and calculated data are less than 0.5% over the full range of fit and maximum discrepancies are 2% or less. The fitting function is also capable of accounting for the sharp increase in g(r) (upturn) seen for some sources for r<0.1 cm. This upturn has previously been attributed to the breakdown of the approximation of the sources as a line, however, in this study we demonstrate that another contributing factor is the 4.5 keV characteristic x-rays emitted from the Ti seed casing. Radial dose functions are calculated for 18 {sup 125}I seeds and 9 {sup 103}Pd seeds using the EGSnrc Monte Carlo user-code BrachyDose. Fitting coefficients of the new function are tabulated for all 27 seeds. Extrapolation characteristics of the function are also investigated. The new functional form is an improvement over currently used fitting functions with its main strength being the ability to accurately fit the rapidly varying radial dose function at small distances. The new function is an excellent candidate for fitting the radial dose function of all {sup 103}Pd and {sup 125}I brachytherapy seeds and will increase the accuracy of dose distributions calculated around brachytherapy seeds using the TG-43 protocol over a wider range of data. More accurate values of g(r) for r<0.5 cm may be particularly important in the treatment of ocular melanoma.

  13. A guide to best practices for Gene Ontology (GO) manual annotation.

    PubMed

    Balakrishnan, Rama; Harris, Midori A; Huntley, Rachael; Van Auken, Kimberly; Cherry, J Michael

    2013-01-01

    The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374,000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. DATABASE URL: http://www.geneontology.org.

  14. Experimental Strategies for Functional Annotation and Metabolism Discovery: Targeted Screening of Solute Binding Proteins and Unbiased Panning of Metabolomes

    PubMed Central

    2015-01-01

    The rate at which genome sequencing data is accruing demands enhanced methods for functional annotation and metabolism discovery. Solute binding proteins (SBPs) facilitate the transport of the first reactant in a metabolic pathway, thereby constraining the regions of chemical space and the chemistries that must be considered for pathway reconstruction. We describe high-throughput protein production and differential scanning fluorimetry platforms, which enabled the screening of 158 SBPs against a 189 component library specifically tailored for this class of proteins. Like all screening efforts, this approach is limited by the practical constraints imposed by construction of the library, i.e., we can study only those metabolites that are known to exist and which can be made in sufficient quantities for experimentation. To move beyond these inherent limitations, we illustrate the promise of crystallographic- and mass spectrometric-based approaches for the unbiased use of entire metabolomes as screening libraries. Together, our approaches identified 40 new SBP ligands, generated experiment-based annotations for 2084 SBPs in 71 isofunctional clusters, and defined numerous metabolic pathways, including novel catabolic pathways for the utilization of ethanolamine as sole nitrogen source and the use of d-Ala-d-Ala as sole carbon source. These efforts begin to define an integrated strategy for realizing the full value of amassing genome sequence data. PMID:25540822

  15. Experimental strategies for functional annotation and metabolism discovery: targeted screening of solute binding proteins and unbiased panning of metabolomes.

    PubMed

    Vetting, Matthew W; Al-Obaidi, Nawar; Zhao, Suwen; San Francisco, Brian; Kim, Jungwook; Wichelecki, Daniel J; Bouvier, Jason T; Solbiati, Jose O; Vu, Hoan; Zhang, Xinshuai; Rodionov, Dmitry A; Love, James D; Hillerich, Brandan S; Seidel, Ronald D; Quinn, Ronald J; Osterman, Andrei L; Cronan, John E; Jacobson, Matthew P; Gerlt, John A; Almo, Steven C

    2015-01-27

    The rate at which genome sequencing data is accruing demands enhanced methods for functional annotation and metabolism discovery. Solute binding proteins (SBPs) facilitate the transport of the first reactant in a metabolic pathway, thereby constraining the regions of chemical space and the chemistries that must be considered for pathway reconstruction. We describe high-throughput protein production and differential scanning fluorimetry platforms, which enabled the screening of 158 SBPs against a 189 component library specifically tailored for this class of proteins. Like all screening efforts, this approach is limited by the practical constraints imposed by construction of the library, i.e., we can study only those metabolites that are known to exist and which can be made in sufficient quantities for experimentation. To move beyond these inherent limitations, we illustrate the promise of crystallographic- and mass spectrometric-based approaches for the unbiased use of entire metabolomes as screening libraries. Together, our approaches identified 40 new SBP ligands, generated experiment-based annotations for 2084 SBPs in 71 isofunctional clusters, and defined numerous metabolic pathways, including novel catabolic pathways for the utilization of ethanolamine as sole nitrogen source and the use of d-Ala-d-Ala as sole carbon source. These efforts begin to define an integrated strategy for realizing the full value of amassing genome sequence data.

  16. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions

    PubMed Central

    Han, Ying; Hazelett, Dennis J.; Wiklund, Fredrik; Schumacher, Fredrick R.; Stram, Daniel O.; Berndt, Sonja I.; Wang, Zhaoming; Rand, Kristin A.; Hoover, Robert N.; Machiela, Mitchell J.; Yeager, Merideth; Burdette, Laurie; Chung, Charles C.; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C.; Key, Timothy J.; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L.; Kolb, Suzanne; Gapstur, Susan M.; Diver, W. Ryan; Stevens, Victoria L.; Strom, Sara S.; Pettaway, Curtis A.; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Yeboah, Edward D.; Tettey, Yao; Biritwum, Richard B.; Adjei, Andrew A.; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P.; Isaacs, William B.; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L.; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M.; Ingles, Sue A.; Kittles, Rick A.; Murphy, Adam B.; Blot, William J.; Signorello, Lisa B.; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M. Cristina; Wu, Suh-Yuh; Hennis, Anselm J. M.; Rybicki, Benjamin A.; Neslund-Dudas, Christine; Hsing, Ann W.; Chu, Lisa; Goodman, Phyllis J.; Klein, Eric A.; Zheng, S. Lilly; Witte, John S.; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L.; Hunter, David J.; Gronberg, Henrik; Cook, Michael B.; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J.; Easton, Douglas F.; Henderson, Brian E.; Coetzee, Gerhard A.; Conti, David V.; Haiman, Christopher A.

    2015-01-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10−4–5.6 × 10−3) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10−6) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  17. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    PubMed

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation.

  18. Dielectric-dependent Density Functionals for Accurate Electronic Structure Calculations of Molecules and Solids

    NASA Astrophysics Data System (ADS)

    Skone, Jonathan; Govoni, Marco; Galli, Giulia

    Dielectric-dependent hybrid [DDH] functionals have recently been shown to yield highly accurate energy gaps and dielectric constants for a wide variety of solids, at a computational cost considerably less than standard GW calculations. The fraction of exact exchange included in the definition of DDH functionals depends (self-consistently) on the dielectric constant of the material. In the present talk we introduce a range-separated (RS) version of DDH functionals where short and long-range components are matched using material dependent, non-empirical parameters. Comparing with state of the art GW calculations and experiment, we show that such RS hybrids yield accurate electronic properties of both molecules and solids, including energy gaps, photoelectron spectra and absolute ionization potentials. This work was supported by NSF-CCI Grant Number NSF-CHE-0802907 and DOE-BES.

  19. The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment.

    PubMed

    Eisenhaber, Birgit; Kuchibhatla, Durga; Sherman, Westley; Sirota, Fernanda L; Berezovsky, Igor N; Wong, Wing-Cheong; Eisenhaber, Frank

    2016-01-01

    As biomolecular sequencing is becoming the main technique in life sciences, functional interpretation of sequences in terms of biomolecular mechanisms with in silico approaches is getting increasingly significant. Function prediction tools are most powerful for protein-coding sequences; yet, the concepts and technologies used for this purpose are not well reflected in bioinformatics textbooks. Notably, protein sequences typically consist of globular domains and non-globular segments. The two types of regions require cardinally different approaches for function prediction. Whereas the former are classic targets for homology-inspired function transfer based on remnant, yet statistically significant sequence similarity to other, characterized sequences, the latter type of regions are characterized by compositional bias or simple, repetitive patterns and require lexical analysis and/or empirical sequence pattern-function correlations. The recipe for function prediction recommends first to find all types of non-globular segments and, then, to subject the remaining query sequence to sequence similarity searches. We provide an updated description of the ANNOTATOR software environment as an advanced example of a software platform that facilitates protein sequence-based function prediction. PMID:27115649

  20. The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment.

    PubMed

    Eisenhaber, Birgit; Kuchibhatla, Durga; Sherman, Westley; Sirota, Fernanda L; Berezovsky, Igor N; Wong, Wing-Cheong; Eisenhaber, Frank

    2016-01-01

    As biomolecular sequencing is becoming the main technique in life sciences, functional interpretation of sequences in terms of biomolecular mechanisms with in silico approaches is getting increasingly significant. Function prediction tools are most powerful for protein-coding sequences; yet, the concepts and technologies used for this purpose are not well reflected in bioinformatics textbooks. Notably, protein sequences typically consist of globular domains and non-globular segments. The two types of regions require cardinally different approaches for function prediction. Whereas the former are classic targets for homology-inspired function transfer based on remnant, yet statistically significant sequence similarity to other, characterized sequences, the latter type of regions are characterized by compositional bias or simple, repetitive patterns and require lexical analysis and/or empirical sequence pattern-function correlations. The recipe for function prediction recommends first to find all types of non-globular segments and, then, to subject the remaining query sequence to sequence similarity searches. We provide an updated description of the ANNOTATOR software environment as an advanced example of a software platform that facilitates protein sequence-based function prediction.

  1. WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation

    PubMed Central

    2013-01-01

    Background SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases. Results The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO3d programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively. Conclusions WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go. PMID:23819482

  2. Analysis and Functional Annotation of an Expressed Sequence Tag Collection for Tropical Crop Sugarcane

    PubMed Central

    Vettore, André L.; da Silva, Felipe R.; Kemper, Edson L.; Souza, Glaucia M.; da Silva, Aline M.; Ferro, Maria Inês T.; Henrique-Silva, Flavio; Giglioti, Éder A.; Lemos, Manoel V.F.; Coutinho, Luiz L.; Nobrega, Marina P.; Carrer, Helaine; França, Suzelei C.; Bacci, Maurício; Goldman, Maria Helena S.; Gomes, Suely L.; Nunes, Luiz R.; Camargo, Luis E.A.; Siqueira, Walter J.; Van Sluys, Marie-Anne; Thiemann, Otavio H.; Kuramae, Eiko E.; Santelli, Roberto V.; Marino, Celso L.; Targon, Maria L.P.N.; Ferro, Jesus A.; Silveira, Henrique C.S.; Marini, Danyelle C.; Lemos, Eliana G.M.; Monteiro-Vitorello, Claudia B.; Tambor, José H.M.; Carraro, Dirce M.; Roberto, Patrícia G.; Martins, Vanderlei G.; Goldman, Gustavo H.; de Oliveira, Regina C.; Truffi, Daniela; Colombo, Carlos A.; Rossi, Magdalena; de Araujo, Paula G.; Sculaccio, Susana A.; Angella, Aline; Lima, Marleide M.A.; de Rosa, Vicente E.; Siviero, Fábio; Coscrato, Virginia E.; Machado, Marcos A.; Grivet, Laurent; Di Mauro, Sonia M.Z.; Nobrega, Francisco G.; Menck, Carlos F.M.; Braga, Marilia D.V.; Telles, Guilherme P.; Cara, Frank A.A.; Pedrosa, Guilherme; Meidanis, João; Arruda, Paulo

    2003-01-01

    To contribute to our understanding of the genome complexity of sugarcane, we undertook a large-scale expressed sequence tag (EST) program. More than 260,000 cDNA clones were partially sequenced from 26 standard cDNA libraries generated from different sugarcane tissues. After the processing of the sequences, 237,954 high-quality ESTs were identified. These ESTs were assembled into 43,141 putative transcripts. Of the assembled sequences, 35.6% presented no matches with existing sequences in public databases. A global analysis of the whole SUCEST data set indicated that 14,409 assembled sequences (33% of the total) contained at least one cDNA clone with a full-length insert. Annotation of the 43,141 assembled sequences associated almost 50% of the putative identified sugarcane genes with protein metabolism, cellular communication/signal transduction, bioenergetics, and stress responses. Inspection of the translated assembled sequences for conserved protein domains revealed 40,821 amino acid sequences with 1415 Pfam domains. Reassembling the consensus sequences of the 43,141 transcripts revealed a 22% redundancy in the first assembling. This indicated that possibly 33,620 unique genes had been identified and indicated that >90% of the sugarcane expressed genes were tagged. PMID:14613979

  3. The genome sequence of Leishmania (Leishmania) amazonensis: functional annotation and extended analysis of gene models.

    PubMed

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; Carmona e Ferreira, Renata; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães; Bahia, Diana

    2013-12-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3'-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

  4. The Genome Sequence of Leishmania (Leishmania) amazonensis: Functional Annotation and Extended Analysis of Gene Models

    PubMed Central

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; e Ferreira, Renata Carmona; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães; Bahia, Diana

    2013-01-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3′-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

  5. An accurate Fortran code for computing hydrogenic continuum wave functions at a wide range of parameters

    NASA Astrophysics Data System (ADS)

    Peng, Liang-You; Gong, Qihuang

    2010-12-01

    The accurate computations of hydrogenic continuum wave functions are very important in many branches of physics such as electron-atom collisions, cold atom physics, and atomic ionization in strong laser fields, etc. Although there already exist various algorithms and codes, most of them are only reliable in a certain ranges of parameters. In some practical applications, accurate continuum wave functions need to be calculated at extremely low energies, large radial distances and/or large angular momentum number. Here we provide such a code, which can generate accurate hydrogenic continuum wave functions and corresponding Coulomb phase shifts at a wide range of parameters. Without any essential restrict to angular momentum number, the present code is able to give reliable results at the electron energy range [10,10] eV for radial distances of [10,10] a.u. We also find the present code is very efficient, which should find numerous applications in many fields such as strong field physics. Program summaryProgram title: HContinuumGautchi Catalogue identifier: AEHD_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHD_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 1233 No. of bytes in distributed program, including test data, etc.: 7405 Distribution format: tar.gz Programming language: Fortran90 in fixed format Computer: AMD Processors Operating system: Linux RAM: 20 MBytes Classification: 2.7, 4.5 Nature of problem: The accurate computation of atomic continuum wave functions is very important in many research fields such as strong field physics and cold atom physics. Although there have already existed various algorithms and codes, most of them can only be applicable and reliable in a certain range of parameters. We present here an accurate FORTRAN program for

  6. Doubly hybrid density functional for accurate descriptions of nonbond interactions, thermochemistry, and thermochemical kinetics

    PubMed Central

    Zhang, Ying; Xu, Xin; Goddard, William A.

    2009-01-01

    We develop and validate a density functional, XYG3, based on the adiabatic connection formalism and the Görling–Levy coupling-constant perturbation expansion to the second order (PT2). XYG3 is a doubly hybrid functional, containing 3 mixing parameters. It has a nonlocal orbital-dependent component in the exchange term (exact exchange) plus information about the unoccupied Kohn–Sham orbitals in the correlation part (PT2 double excitation). XYG3 is remarkably accurate for thermochemistry, reaction barrier heights, and nonbond interactions of main group molecules. In addition, the accuracy remains nearly constant with system size. PMID:19276116

  7. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  8. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium.

    PubMed

    Williams, Baraka S; Isokpehi, Raphael D; Mbah, Andreas N; Hollman, Antoinesha L; Bernard, Christina O; Simmons, Shaneka S; Ayensu, Wellington K; Garner, Bianca L

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  9. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium

    PubMed Central

    Williams, Baraka S.; Isokpehi, Raphael D.; Mbah, Andreas N.; Hollman, Antoinesha L.; Bernard, Christina O.; Simmons, Shaneka S.; Ayensu, Wellington K.; Garner, Bianca L.

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  10. Functional annotation of native enhancers with a Cas9 -histone demethylase fusion

    PubMed Central

    Tabak, Barbara; Genga, Ryan M; Silverstein, Noah J; Garber, Manuel; Maehr, René

    2015-01-01

    Understanding of mammalian enhancer function is limited by the lack of a technology to rapidly and thoroughly test their cell type-specific function. Here, we use a nuclease-deficient (d)Cas9 histone demethylase fusion to functionally characterize previously described and novel enhancer elements for their roles in the embryonic stem cell state. Further, we distinguish the mechanism of action of dCas9-LSD1 at enhancers from previous dCas9-effectors. PMID:25775043

  11. Next-Generation High-Throughput Functional Annotation of Microbial Genomes

    PubMed Central

    Baric, Ralph S.; Damania, Blossom; Miller, Samuel I.; Rubin, Eric J.

    2016-01-01

    ABSTRACT Host infection by microbial pathogens cues global changes in microbial and host cell biology that facilitate microbial replication and disease. The complete maps of thousands of bacterial and viral genomes have recently been defined; however, the rate at which physiological or biochemical functions have been assigned to genes has greatly lagged. The National Institute of Allergy and Infectious Diseases (NIAID) addressed this gap by creating functional genomics centers dedicated to developing high-throughput approaches to assign gene function. These centers require broad-based and collaborative research programs to generate and integrate diverse data to achieve a comprehensive understanding of microbial pathogenesis. High-throughput functional genomics can lead to new therapeutics and better understanding of the next generation of emerging pathogens by rapidly defining new general mechanisms by which organisms cause disease and replicate in host tissues and by facilitating the rate at which functional data reach the scientific community. PMID:27703071

  12. Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors.

    PubMed

    Flower, Darren R; Attwood, Teresa K

    2004-12-01

    G protein-coupled receptors (GPCR) are amongst the best studied and most functionally diverse types of cell-surface protein. The importance of GPCRs as mediates or cell function and organismal developmental underlies their involvement in key physiological roles and their prominence as targets for pharmacological therapeutics. In this review, we highlight the requirement for integrated protocols which underline the different perspectives offered by different sequence analysis methods. BLAST and FastA offer broad brush strokes. Motif-based search methods add the fine detail. Structural modelling offers another perspective which allows us to elucidate the physicochemical properties that underlie ligand binding. Together, these different views provide a more informative and a more detailed picture of GPCR structure and function. Many GPCRs remain orphan receptors with no identified ligand, yet as computer-driven functional genomics starts to elaborate their functions, a new understanding of their roles in cell and developmental biology will follow. PMID:15561589

  13. Multiple-Resonance Local Wave Functions for Accurate Excited States in Quantum Monte Carlo.

    PubMed

    Zulfikri, Habiburrahman; Amovilli, Claudio; Filippi, Claudia

    2016-03-01

    We introduce a novel class of local multideterminant Jastrow-Slater wave functions for the efficient and accurate treatment of excited states in quantum Monte Carlo. The wave function is expanded as a linear combination of excitations built from multiple sets of localized orbitals that correspond to the bonding patterns of the different Lewis resonance structures of the molecule. We capitalize on the concept of orbital domains of local coupled-cluster methods, which is here applied to the active space to select the orbitals to correlate and construct the important transitions. The excitations are further grouped into classes, which are ordered in importance and can be systematically included in the Jastrow-Slater wave function to ensure a balanced description of all states of interest. We assess the performance of the proposed wave function in the calculation of vertical excitation energies and excited-state geometry optimization of retinal models whose π → π* state has a strong intramolecular charge-transfer character. We find that our multiresonance wave functions recover the reference values of the total energies of the ground and excited states with only a small number of excitations and that the same expansion can be flexibly used at very different geometries. Furthermore, significant computational saving can also be gained in the orbital optimization step by selectively mixing occupied and virtual orbitals based on spatial considerations without loss of accuracy on the excitation energy. Our multiresonance wave functions are therefore compact, accurate, and very promising for the calculation of multiple excited states of different character in large molecules.

  14. Accurate evaluation of the angular-dependent direct correlation function of water

    NASA Astrophysics Data System (ADS)

    Zhao, Shuangliang; Liu, Honglai; Ramirez, Rosa; Borgis, Daniel

    2013-07-01

    The direct correlation function (DCF) plays a pivotal role in addressing the thermodynamic properties with non-mean-field statistical theories of liquid state. This work provides an accurate yet efficient calculation procedure for evaluating the angular-dependent DCF of bulk SPC/E water. The DCF here represented in a discrete angles basis is computed with two typical steps: the first step involves solving the molecular Ornstein-Zernike equation with the input of total correlation function extracted from simulation; the resultant DCF is then polished in second step at small wavelength for all orientations in order to match correct thermodynamic properties. This function is also discussed in terms of its rotational invariant components. In particular, we show that the component c112(r) that accounts for dipolar symmetry reaches already its long-range asymptotic behavior at a short distance of 4 Å. With the knowledge of DCF, the angular-dependent bridge function of bulk water is thereafter computed and discussed in comparison with referenced hard-sphere bridge functions. We conclude that, even though such hard-sphere bridge functions may be relevant to improve the calculation of Helmholtz free energies in integral equations or density functional theory, they are doomed to fail at a structural level.

  15. Accurate first-principles structures and energies of diversely bonded systems from an efficient density functional

    NASA Astrophysics Data System (ADS)

    Sun, Jianwei; Remsing, Richard C.; Zhang, Yubo; Sun, Zhaoru; Ruzsinszky, Adrienn; Peng, Haowei; Yang, Zenghui; Paul, Arpita; Waghmare, Umesh; Wu, Xifan; Klein, Michael L.; Perdew, John P.

    2016-09-01

    One atom or molecule binds to another through various types of bond, the strengths of which range from several meV to several eV. Although some computational methods can provide accurate descriptions of all bond types, those methods are not efficient enough for many studies (for example, large systems, ab initio molecular dynamics and high-throughput searches for functional materials). Here, we show that the recently developed non-empirical strongly constrained and appropriately normed (SCAN) meta-generalized gradient approximation (meta-GGA) within the density functional theory framework predicts accurate geometries and energies of diversely bonded molecules and materials (including covalent, metallic, ionic, hydrogen and van der Waals bonds). This represents a significant improvement at comparable efficiency over its predecessors, the GGAs that currently dominate materials computation. Often, SCAN matches or improves on the accuracy of a computationally expensive hybrid functional, at almost-GGA cost. SCAN is therefore expected to have a broad impact on chemistry and materials science.

  16. Do Bond Functions Help for the Calculation of Accurate Bond Energies?

    NASA Technical Reports Server (NTRS)

    Bauschlicher, Charles W., Jr.; Arnold, James (Technical Monitor)

    1998-01-01

    The bond energies of 8 chemically bound diatomics are computed using several basis sets with and without bond functions (BF). The bond energies obtained using the aug-pVnZ+BF basis sets (with a correction for basis set superposition error, BSSE) tend to be slightly smaller that the results obtained using the aug-pV(n+I)Z basis sets, but slightly larger than the BSSE corrected aug-pV(n+I)Z results. The aug-cc-pVDZ+BF and aug-cc-pVTZ+BF basis sets yield reasonable estimates of bond energies, but, in most cases, these results cannot be considered highly accurate. Extrapolation of the results obtained with basis sets including bond functions appears to be inferior to the results obtained by extrapolation using atom-centered basis sets. Therefore bond functions do not appear to offer a path for obtaining highly accurate results for chemically bound systems at a lower computational cost than atom centered basis sets.

  17. Accurate first-principles structures and energies of diversely bonded systems from an efficient density functional.

    PubMed

    Sun, Jianwei; Remsing, Richard C; Zhang, Yubo; Sun, Zhaoru; Ruzsinszky, Adrienn; Peng, Haowei; Yang, Zenghui; Paul, Arpita; Waghmare, Umesh; Wu, Xifan; Klein, Michael L; Perdew, John P

    2016-09-01

    One atom or molecule binds to another through various types of bond, the strengths of which range from several meV to several eV. Although some computational methods can provide accurate descriptions of all bond types, those methods are not efficient enough for many studies (for example, large systems, ab initio molecular dynamics and high-throughput searches for functional materials). Here, we show that the recently developed non-empirical strongly constrained and appropriately normed (SCAN) meta-generalized gradient approximation (meta-GGA) within the density functional theory framework predicts accurate geometries and energies of diversely bonded molecules and materials (including covalent, metallic, ionic, hydrogen and van der Waals bonds). This represents a significant improvement at comparable efficiency over its predecessors, the GGAs that currently dominate materials computation. Often, SCAN matches or improves on the accuracy of a computationally expensive hybrid functional, at almost-GGA cost. SCAN is therefore expected to have a broad impact on chemistry and materials science. PMID:27554409

  18. Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

    PubMed Central

    2010-01-01

    Background The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions acquired by the ruminant lineage. The aim of the present study was to infer functions of previously uncharacterized lineage-specific genes (LSGs) using co-expression, promoter, pathway and network analysis. Results Clusters of co-expressed genes preferentially expressed in liver, placenta and thymus were found using 49 previously uncharacterized LSTs as seeds. Over-represented composite transcription factor binding sites (TFBS) in promoters of clustered LSGs and known genes were then identified computationally. Functions were inferred for nine previously uncharacterized LSGs using co-expression analysis and pathway analysis tools. Our results predict that these LSGs may function in cell signaling, glycerophospholipid/fatty acid metabolism, protein trafficking, regulatory processes in the nucleus, and processes that initiate parturition and immune system development. Conclusions The placenta is a rich source of lineage-specific genes that function in the adaptive evolution of placental architecture and functions. We have shown that co-expression, promoter, and gene network analyses are useful methods to infer functions of LSGs with heretofore unknown functions. Our results indicate that many LSGs are involved in cellular recognition and developmental processes. Furthermore, they provide guidance for experimental approaches to validate the functions of LSGs and to study their evolution. PMID:20214810

  19. Transcriptomic Analysis of the Endangered Neritid Species Clithon retropictus: De Novo Assembly, Functional Annotation, and Marker Discovery

    PubMed Central

    Park, So Young; Patnaik, Bharat Bhusan; Kang, Se Won; Hwang, Hee-Ju; Chung, Jong Min; Song, Dae Kwon; Sang, Min Kyu; Patnaik, Hongray Howrelia; Lee, Jae Bong; Noh, Mi Young; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Lee, Jun Sang; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    An aquatic gastropod belonging to the family Neritidae, Clithon retropictus is listed as an endangered class II species in South Korea. The lack of information on its genomic background limits the ability to obtain functional data resources and inhibits informed conservation planning for this species. In the present study, the transcriptomic sequencing and de novo assembly of C. retropictus generated a total of 241,696,750 high-quality reads. These assembled to 282,838 unigenes with mean and N50 lengths of 736.9 and 1201 base pairs, respectively. Of these, 125,616 unigenes were subjected to annotation analysis with known proteins in Protostome DB, COG, GO, and KEGG protein databases (BLASTX; E ≤ 0.00001) and with known nucleotides in the Unigene database (BLASTN; E ≤ 0.00001). The GO analysis indicated that cellular process, cell, and catalytic activity are the predominant GO terms in the biological process, cellular component, and molecular function categories, respectively. In addition, 2093 unigenes were distributed in 107 different KEGG pathways. Furthermore, 49,280 simple sequence repeats were identified in the unigenes (>1 kilobase sequences). This is the first report on the identification of transcriptomic and microsatellite resources for C. retropictus, which opens up the possibility of exploring traits related to the adaptation and acclimatization of this species. PMID:27455329

  20. Transcriptomic Analysis of the Endangered Neritid Species Clithon retropictus: De Novo Assembly, Functional Annotation, and Marker Discovery.

    PubMed

    Park, So Young; Patnaik, Bharat Bhusan; Kang, Se Won; Hwang, Hee-Ju; Chung, Jong Min; Song, Dae Kwon; Sang, Min Kyu; Patnaik, Hongray Howrelia; Lee, Jae Bong; Noh, Mi Young; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Lee, Jun Sang; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    An aquatic gastropod belonging to the family Neritidae, Clithon retropictus is listed as an endangered class II species in South Korea. The lack of information on its genomic background limits the ability to obtain functional data resources and inhibits informed conservation planning for this species. In the present study, the transcriptomic sequencing and de novo assembly of C. retropictus generated a total of 241,696,750 high-quality reads. These assembled to 282,838 unigenes with mean and N50 lengths of 736.9 and 1201 base pairs, respectively. Of these, 125,616 unigenes were subjected to annotation analysis with known proteins in Protostome DB, COG, GO, and KEGG protein databases (BLASTX; E ≤ 0.00001) and with known nucleotides in the Unigene database (BLASTN; E ≤ 0.00001). The GO analysis indicated that cellular process, cell, and catalytic activity are the predominant GO terms in the biological process, cellular component, and molecular function categories, respectively. In addition, 2093 unigenes were distributed in 107 different KEGG pathways. Furthermore, 49,280 simple sequence repeats were identified in the unigenes (>1 kilobase sequences). This is the first report on the identification of transcriptomic and microsatellite resources for C. retropictus, which opens up the possibility of exploring traits related to the adaptation and acclimatization of this species. PMID:27455329

  1. Transcriptomic Analysis of the Endangered Neritid Species Clithon retropictus: De Novo Assembly, Functional Annotation, and Marker Discovery.

    PubMed

    Park, So Young; Patnaik, Bharat Bhusan; Kang, Se Won; Hwang, Hee-Ju; Chung, Jong Min; Song, Dae Kwon; Sang, Min Kyu; Patnaik, Hongray Howrelia; Lee, Jae Bong; Noh, Mi Young; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Lee, Jun Sang; Han, Yeon Soo; Lee, Yong Seok

    2016-07-22

    An aquatic gastropod belonging to the family Neritidae, Clithon retropictus is listed as an endangered class II species in South Korea. The lack of information on its genomic background limits the ability to obtain functional data resources and inhibits informed conservation planning for this species. In the present study, the transcriptomic sequencing and de novo assembly of C. retropictus generated a total of 241,696,750 high-quality reads. These assembled to 282,838 unigenes with mean and N50 lengths of 736.9 and 1201 base pairs, respectively. Of these, 125,616 unigenes were subjected to annotation analysis with known proteins in Protostome DB, COG, GO, and KEGG protein databases (BLASTX; E ≤ 0.00001) and with known nucleotides in the Unigene database (BLASTN; E ≤ 0.00001). The GO analysis indicated that cellular process, cell, and catalytic activity are the predominant GO terms in the biological process, cellular component, and molecular function categories, respectively. In addition, 2093 unigenes were distributed in 107 different KEGG pathways. Furthermore, 49,280 simple sequence repeats were identified in the unigenes (>1 kilobase sequences). This is the first report on the identification of transcriptomic and microsatellite resources for C. retropictus, which opens up the possibility of exploring traits related to the adaptation and acclimatization of this species.

  2. mBEEF: An accurate semi-local Bayesian error estimation density functional

    NASA Astrophysics Data System (ADS)

    Wellendorff, Jess; Lundgaard, Keld T.; Jacobsen, Karsten W.; Bligaard, Thomas

    2014-04-01

    We present a general-purpose meta-generalized gradient approximation (MGGA) exchange-correlation functional generated within the Bayesian error estimation functional framework [J. Wellendorff, K. T. Lundgaard, A. Møgelhøj, V. Petzold, D. D. Landis, J. K. Nørskov, T. Bligaard, and K. W. Jacobsen, Phys. Rev. B 85, 235149 (2012)]. The functional is designed to give reasonably accurate density functional theory (DFT) predictions of a broad range of properties in materials physics and chemistry, while exhibiting a high degree of transferability. Particularly, it improves upon solid cohesive energies and lattice constants over the BEEF-vdW functional without compromising high performance on adsorption and reaction energies. We thus expect it to be particularly well-suited for studies in surface science and catalysis. An ensemble of functionals for error estimation in DFT is an intrinsic feature of exchange-correlation models designed this way, and we show how the Bayesian ensemble may provide a systematic analysis of the reliability of DFT based simulations.

  3. Reliable Spectroscopic Constants for CCH-, NH2- and Their Isotopomers from an Accurate Potential Energy Function

    NASA Technical Reports Server (NTRS)

    Lee, Timothy J.; Dateo, Christopher E.; Schwenke, David W.; Chaban, Galina M.

    2005-01-01

    Accurate quartic force fields have been determined for the CCH- and NH2- molecular anions using the singles and doubles coupled-cluster method that includes a perturbational estimate of the effects of connected triple excitations, CCSD(T). Very large one-particle basis sets have been used including diffuse functions and up through g-type functions. Correlation of the nitrogen and carbon core electrons has been included, as well as other "small" effects, such as the diagonal Born-Oppenheimer correction, and basis set extrapolation, and corrections for higher-order correlation effects and scalar relativistic effects. Fundamental vibrational frequencies have been computed using standard second-order perturbation theory as well as variational methods. Comparison with the available experimental data is presented and discussed. The implications of our research for the astronomical observation of molecular anions will be discussed.

  4. Screened exchange hybrid density functional for accurate and efficient structures and interaction energies.

    PubMed

    Brandenburg, Jan Gerit; Caldeweyher, Eike; Grimme, Stefan

    2016-06-21

    We extend the recently introduced PBEh-3c global hybrid density functional [S. Grimme et al., J. Chem. Phys., 2015, 143, 054107] by a screened Fock exchange variant based on the Henderson-Janesko-Scuseria exchange hole model. While the excellent performance of the global hybrid is maintained for small covalently bound molecules, its performance for computed condensed phase mass densities is further improved. Most importantly, a speed up of 30 to 50% can be achieved and especially for small orbital energy gap cases, the method is numerically much more robust. The latter point is important for many applications, e.g., for metal-organic frameworks, organic semiconductors, or protein structures. This enables an accurate density functional based electronic structure calculation of a full DNA helix structure on a single core desktop computer which is presented as an example in addition to comprehensive benchmark results. PMID:27240749

  5. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Cancer.gov

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  6. Morgan’s Legacy: Fruit Flies and the Functional Annotation of Conserved Genes

    PubMed Central

    Bellen, Hugo J.; Yamamoto, Shinya

    2016-01-01

    In 1915, “The Mechanism of Mendelian Heredity” was published by four prominent Drosophila geneticists. They discovered that genes form linkage groups on chromosomes inherited in a Mendelian fashion and laid the genetic foundation that promoted Drosophila as a model organism. Flies continue to offer great opportunities, including studies in the field of functional genomics. PMID:26406362

  7. Accurate determination of renal function in patients with intestinal urinary diversions

    SciTech Connect

    McDougal, W.S.; Koch, M.O.

    1986-06-01

    The regular determination of renal function is a critical part of the management of patients who have had the urinary tract reconstructed with intestinal segments. These intestinal segments reabsorb urinary solutes and, thereby, complicate the determination of renal function by conventional methods. Urinary clearances of urea, creatinine and inulin were performed in patients with intestinal segments in the urinary tract and controls under varying diuretic conditions. Patients with intestinal diversions also underwent radioisotopic determination of renal function. The urinary clearances of urea, creatinine and inulin are highly dependent on the rate of urine flow in patients with intestinal segments in the urinary tract. Diuresis maximizes the urinary clearances of these solutes by minimizing intestinal reabsorption. Creatinine clearance prediction from the serum creatinine underestimates true glomerular filtration rate. Radioisotopic determination of renal function correlates poorly with true glomerular filtration rate. Only creatinine clearance measured under diuretic conditions correlates well with true renal function. Urine concentrating ability cannot be assessed accurately in patients with intestinal segments in the urinary tract, since osmolality rapidly equilibrates across the segments.

  8. Annotation of gene function in citrus using gene expression information and co-expression networks

    PubMed Central

    2014-01-01

    Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks

  9. An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction

    PubMed Central

    Piro, Rosario Michael; Ala, Ugo; Molineris, Ivan; Grassi, Elena; Bracco, Chiara; Perego, Gian Paolo; Provero, Paolo; Di Cunto, Ferdinando

    2011-01-01

    Gene coexpression relationships that are phylogenetically conserved between human and mouse have been shown to provide important clues about gene function that can be efficiently used to identify promising candidate genes for human hereditary disorders. In the past, such approaches have considered mostly generic gene expression profiles that cover multiple tissues and organs. The individual genes of multicellular organisms, however, can participate in different transcriptional programs, operating at scales as different as single-cell types, tissues, organs, body regions or the entire organism. Therefore, systematic analysis of tissue-specific coexpression could be, in principle, a very powerful strategy to dissect those functional relationships among genes that emerge only in particular tissues or organs. In this report, we show that, in fact, conserved coexpression as determined from tissue-specific and condition-specific data sets can predict many functional relationships that are not detected by analyzing heterogeneous microarray data sets. More importantly, we find that, when combined with disease networks, the simultaneous use of both generic (multi-tissue) and tissue-specific conserved coexpression allows a more efficient prediction of human disease genes than the use of generic conserved coexpression alone. Using this strategy, we were able to identify high-probability candidates for 238 orphan disease loci. We provide proof of concept that this combined use of generic and tissue-specific conserved coexpression can be very useful to prioritize the mutational candidates obtained from deep-sequencing projects, even in the case of genetic disorders as heterogeneous as XLMR. PMID:21654723

  10. TreeQ-VISTA: An Interactive Tree Visualization Tool withFunctional Annotation Query Capabilities

    SciTech Connect

    Gu, Shengyin; Anderson, Iain; Kunin, Victor; Cipriano, Michael; Minovitsky, Simon; Weber, Gunther; Amenta, Nina; Hamann, Bernd; Dubchak,Inna

    2007-05-07

    Summary: We describe a general multiplatform exploratorytool called TreeQ-Vista, designed for presenting functional annotationsin a phylogenetic context. Traits, such as phenotypic and genomicproperties, are interactively queried from a relational database with auser-friendly interface which provides a set of tools for users with orwithout SQL knowledge. The query results are projected onto aphylogenetic tree and can be displayed in multiple color groups. A richset of browsing, grouping and query tools are provided to facilitatetrait exploration, comparison and analysis.Availability: The program,detailed tutorial and examples are available online athttp://genome-test.lbl.gov/vista/TreeQVista.

  11. IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform

    PubMed Central

    Singh, Neetu; Sahu, Dinesh Kumar; Chowdhry, Rebecca; Mishra, Archana; Goel, Madhu Mati; Faheem, Mohd; Srivastava, Chhitij; Ojha, Bal Krishna; Gupta, Devendra Kumar; Kant, Ravi

    2015-01-01

    Here, we sequenced and functionally annotated the long reads (1–2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes–—PNMT, SNN and LAMB1 show their involvement in Ependymoma. PMID:26862483

  12. IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform.

    PubMed

    Singh, Neetu; Sahu, Dinesh Kumar; Chowdhry, Rebecca; Mishra, Archana; Goel, Madhu Mati; Faheem, Mohd; Srivastava, Chhitij; Ojha, Bal Krishna; Gupta, Devendra Kumar; Kant, Ravi

    2016-02-01

    Here, we sequenced and functionally annotated the long reads (1-2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes--PNMT, SNN and LAMB1 show their involvement in Ependymoma. PMID:26862483

  13. IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform.

    PubMed

    Singh, Neetu; Sahu, Dinesh Kumar; Chowdhry, Rebecca; Mishra, Archana; Goel, Madhu Mati; Faheem, Mohd; Srivastava, Chhitij; Ojha, Bal Krishna; Gupta, Devendra Kumar; Kant, Ravi

    2016-02-01

    Here, we sequenced and functionally annotated the long reads (1-2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes--PNMT, SNN and LAMB1 show their involvement in Ependymoma.

  14. Partitioning heritability by functional annotation using genome-wide association summary statistics.

    PubMed

    Finucane, Hilary K; Bulik-Sullivan, Brendan; Gusev, Alexander; Trynka, Gosia; Reshef, Yakir; Loh, Po-Ru; Anttila, Verneri; Xu, Han; Zang, Chongzhi; Farh, Kyle; Ripke, Stephan; Day, Felix R; Purcell, Shaun; Stahl, Eli; Lindstrom, Sara; Perry, John R B; Okada, Yukinori; Raychaudhuri, Soumya; Daly, Mark J; Patterson, Nick; Neale, Benjamin M; Price, Alkes L

    2015-11-01

    Recent work has demonstrated that some functional categories of the genome contribute disproportionately to the heritability of complex diseases. Here we analyze a broad set of functional elements, including cell type-specific elements, to estimate their polygenic contributions to heritability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD score regression, for partitioning heritability from GWAS summary statistics while accounting for linked markers. This new method is computationally tractable at very large sample sizes and leverages genome-wide information. Our findings include a large enrichment of heritability in conserved regions across many traits, a very large immunological disease-specific enrichment of heritability in FANTOM5 enhancers and many cell type-specific enrichments, including significant enrichment of central nervous system cell types in the heritability of body mass index, age at menarche, educational attainment and smoking behavior. PMID:26414678

  15. Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  16. Modelling the Constraints of Spatial Environment in Fauna Movement Simulations: Comparison of a Boundaries Accurate Function and a Cost Function

    NASA Astrophysics Data System (ADS)

    Jolivet, L.; Cohen, M.; Ruas, A.

    2015-08-01

    Landscape influences fauna movement at different levels, from habitat selection to choices of movements' direction. Our goal is to provide a development frame in order to test simulation functions for animal's movement. We describe our approach for such simulations and we compare two types of functions to calculate trajectories. To do so, we first modelled the role of landscape elements to differentiate between elements that facilitate movements and the ones being hindrances. Different influences are identified depending on landscape elements and on animal species. Knowledge were gathered from ecologists, literature and observation datasets. Second, we analysed the description of animal movement recorded with GPS at fine scale, corresponding to high temporal frequency and good location accuracy. Analysing this type of data provides information on the relation between landscape features and movements. We implemented an agent-based simulation approach to calculate potential trajectories constrained by the spatial environment and individual's behaviour. We tested two functions that consider space differently: one function takes into account the geometry and the types of landscape elements and one cost function sums up the spatial surroundings of an individual. Results highlight the fact that the cost function exaggerates the distances travelled by an individual and simplifies movement patterns. The geometry accurate function represents a good bottom-up approach for discovering interesting areas or obstacles for movements.

  17. Relieving the cardiometabolic disease burden: a perspective on phytometabolite functional and chemical annotation for diabetes management.

    PubMed

    Janero, David R

    2014-01-01

    Type 2 diabetes (T2D) is both a complex, multifactorial disease state and an unsolved, intensifying public-health problem. To help reduce disease burden, some T2D patients have embraced plant-derived substances for use with - if not in place of - prescription medicines, a trend based mainly upon historical precedent and anecdotal observations of human health benefit. Preclinical research has emphasized phytometabolite interactions with purported T2D pathogenic targets and the effects of botanical preparations on experimental T2D symptomology as induced in laboratory animals. More holistic, systems-oriented profiling of phytochemicals with functional-biology, omics, and chemical-fingerprinting tools now appears necessary to increase our appreciation of phytometabolite actions potentially beneficial to the T2D patient. The resultant, multidimensional view of phytometabolite pharmacology should help provide a more rational basis for evaluating the potential of natural plant products as T2D pharmacotherapy. Such information may also help substantiate and legitimize (pre)clinical demonstrations of phytochemical health benefits, advance our understanding of T2D pathogenesis, and offer scope for better T2D medicines. Public-private partnerships are invoked for conducting this research with the ultimate aim of improving the global cardiometabolic profile. PMID:24156826

  18. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

    PubMed Central

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

  19. Functional Annotation of Metastasis-associated MicroRNAs of Melanoma: A Meta-analysis of Expression Profiles

    PubMed Central

    Li, Jing-Yi; Zheng, Li-Li; Wang, Ting-Ting; Hu, Min

    2016-01-01

    Background: Melanoma is a type of cancer that develops from the pigment-containing cells. Until now, its pathological mechanisms remain largely unknown. The aim of this study was to identify metastasis-related microRNA (miRNAs) and gain an understanding of the biological functions in the metastasis of melanoma. Methods: We searched the PubMed and Gene Expression Omnibus database to collect miRNA expression profiling datasets about melanoma, with key words of “melanoma”, “miRNA”, “microarray”, and “gene expression profiling”. Only the original experimental works published before June 2016 for analyzing the metastasis of melanoma were retained, other nonhuman studies, reviews, and meta-analyses were removed. We performed a meta-analysis to explore the differentially expressed miRNA between metastatic and nonmetastatic samples. Moreover, we predicted target genes of the miRNAs to study their biological roles for these miRNAs. Results: We identified a total of 63 significantly differentially expressed miRNAs by meta-analysis of the melanoma expression profiling data. The regulatory network constructed by using these miRNAs and the predicted targets identified several key genes involved in the metastasis of melanoma. Functional annotation of these genes indicated that they are mainly enriched in some biological pathways such as mitogen-activated protein kinase signaling pathway, cell junction, and focal adhesion. Conclusions: By collecting the miRNA expression datasets from different platforms, multiple biological markers were identified for the metastasis of melanoma. This study provided novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease. PMID:27748342

  20. Functional Annotation of Two New Carboxypeptidases from the Amidohydrolase Superfamily of Enzymes

    SciTech Connect

    Xiang, D.; Xu, C; Kumaran, D; Brown, A; Sauder, M; Burley, S; Swaminathan, S; Raushel, F

    2009-01-01

    Two proteins from the amidohydrolase superfamily of enzymes were cloned, expressed, and purified to homogeneity. The first protein, Cc0300, was from Caulobacter crescentus CB-15 (Cc0300), while the second one (Sgx9355e) was derived from an environmental DNA sequence originally isolated from the Sargasso Sea (gi|44371129). The catalytic functions and the substrate profiles for the two enzymes were determined with the aid of combinatorial dipeptide libraries. Both enzymes were shown to catalyze the hydrolysis of l-Xaa-l-Xaa dipeptides in which the amino acid at the N-terminus was relatively unimportant. These enzymes were specific for hydrophobic amino acids at the C-terminus. With Cc0300, substrates terminating in isoleucine, leucine, phenylalanine, tyrosine, valine, methionine, and tryptophan were hydrolyzed. The same specificity was observed with Sgx9355e, but this protein was also able to hydrolyze peptides terminating in threonine. Both enzymes were able to hydrolyze N-acetyl and N-formyl derivatives of the hydrophobic amino acids and tripeptides. The best substrates identified for Cc0300 were l-Ala-l-Leu with kcat and kcat/Km values of 37 s-1 and 1.1 x 105 M-1 s-1, respectively, and N-formyl-l-Tyr with kcat and kcat/Km values of 33 s-1 and 3.9 x 105 M-1 s-1, respectively. The best substrate identified for Sgx9355e was l-Ala-l-Phe with kcat and kcat/Km values of 0.41 s-1 and 5.8 x 103 M-1 s-1. The three-dimensional structure of Sgx9355e was determined to a resolution of 2.33 Angstroms with l-methionine bound in the active site. The a-carboxylate of the methionine is ion-paired to His-237 and also hydrogen bonded to the backbone amide groups of Val-201 and Leu-202. The a-amino group of the bound methionine interacts with Asp-328. The structural determinants for substrate recognition were identified and compared with other enzymes in this superfamily that hydrolyze dipeptides with different specificities.

  1. Ranking biomedical annotations with annotator's semantic relevancy.

    PubMed

    Wu, Aihua

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  2. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  3. Muscle function during brief maximal exercise: accurate measurements on a friction-loaded cycle ergometer.

    PubMed

    Arsac, L M; Belli, A; Lacour, J R

    1996-01-01

    A friction loaded cycle ergometer was instrumented with a strain gauge and an incremental encoder to obtain accurate measurement of human mechanical work output during the acceleration phase of a cycling sprint. This device was used to characterise muscle function in a group of 15 well-trained male subjects, asked to perform six short maximal sprints on the cycle against a constant friction load. Friction loads were successively set at 0.25, 0.35, 0.45, 0.55, 0.65 and 0.75 N.kg-1 body mass. Since the sprints were performed from a standing start, and since the acceleration was not restricted, the greatest attention was paid to the measurement of the acceleration balancing load due to flywheel inertia. Instantaneous pedalling velocity (v) and power output (P) were calculated each 5 ms and then averaged over each downstroke period so that each pedal downstroke provided a combination of v, force and P. Since an 8-s acceleration phase was composed of about 21 to 34 pedal downstrokes, this many v-P combinations were obtained amounting to 137-180 v-P combinations for all six friction loads in one individual, over the widest functional range of pedalling velocities (17-214 rpm). Thus, the individual's muscle function was characterised by the v-P relationships obtained during the six acceleration phases of the six sprints. An important finding of the present study was a strong linear relationship between individual optimal velocity (vopt) and individual maximal power output (Pmax) (n = 15, r = 0.95, P < 0.001) which has never been observed before. Since vopt has been demonstrated to be related to human fibre type composition both vopt, Pmax and their inter-relationship could represent a major feature in characterising muscle function in maximal unrestricted exercise. It is suggested that the present method is well suited to such analyses.

  4. Accurate and systematically improvable density functional theory embedding for correlated wavefunctions

    SciTech Connect

    Goodpaster, Jason D.; Barnes, Taylor A.; Miller, Thomas F.; Manby, Frederick R.

    2014-05-14

    We analyze the sources of error in quantum embedding calculations in which an active subsystem is treated using wavefunction methods, and the remainder using density functional theory. We show that the embedding potential felt by the electrons in the active subsystem makes only a small contribution to the error of the method, whereas the error in the nonadditive exchange-correlation energy dominates. We test an MP2 correction for this term and demonstrate that the corrected embedding scheme accurately reproduces wavefunction calculations for a series of chemical reactions. Our projector-based embedding method uses localized occupied orbitals to partition the system; as with other local correlation methods, abrupt changes in the character of the localized orbitals along a reaction coordinate can lead to discontinuities in the embedded energy, but we show that these discontinuities are small and can be systematically reduced by increasing the size of the active region. Convergence of reaction energies with respect to the size of the active subsystem is shown to be rapid for all cases where the density functional treatment is able to capture the polarization of the environment, even in conjugated systems, and even when the partition cuts across a double bond.

  5. SCAN: An Efficient Density Functional Yielding Accurate Structures and Energies of Diversely-Bonded Materials

    NASA Astrophysics Data System (ADS)

    Sun, Jianwei

    The accuracy and computational efficiency of the widely used Kohn-Sham density functional theory (DFT) are limited by the approximation to its exchange-correlation energy Exc. The earliest local density approximation (LDA) overestimates the strengths of all bonds near equilibrium (even the vdW bonds). By adding the electron density gradient to model Exc, generalized gradient approximations (GGAs) generally soften the bonds to give robust and overall more accurate descriptions, except for the vdW interaction which is largely lost. Further improvement for covalent, ionic, and hydrogen bonds can be obtained by the computationally more expensive hybrid GGAs, which mix GGAs with the nonlocal exact exchange. Meta-GGAs are still semilocal in computation and thus efficient. Compared to GGAs, they add the kinetic energy density that enables them to recognize and accordingly treat different bonds, which no LDA or GGA can. We show here that the recently developed non-empirical strongly constrained and appropriately normed (SCAN) meta-GGA improves significantly over LDA and the standard Perdew-Burke-Ernzerhof GGA for geometries and energies of diversely-bonded materials (including covalent, metallic, ionic, hydrogen, and vdW bonds) at comparable efficiency. Often SCAN matches or improves upon the accuracy of a hybrid functional, at almost-GGA cost. This work has been supported by NSF under DMR-1305135 and CNS-09-58854, and by DOE BES EFRC CCDM under DE-SC0012575.

  6. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient

  7. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient

  8. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

    PubMed Central

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2016-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave

  9. Transcriptome Analysis of the Emerald Ash Borer (EAB), Agrilus planipennis: De Novo Assembly, Functional Annotation and Comparative Analysis

    PubMed Central

    Duan, Jun; Ladd, Tim; Doucet, Daniel; Cusson, Michel; vanFrankenhuyzen, Kees; Mittapalli, Omprakash; Krell, Peter J.; Quan, Guoxing

    2015-01-01

    Background The Emerald ash borer (EAB), Agrilus planipennis, is an invasive phloem-feeding insect pest of ash trees. Since its initial discovery near the Detroit, US- Windsor, Canada area in 2002, the spread of EAB has had strong negative economic, social and environmental impacts in both countries. Several transcriptomes from specific tissues including midgut, fat body and antenna have recently been generated. However, the relatively low sequence depth, gene coverage and completeness limited the usefulness of these EAB databases. Methodology and Principal Findings High-throughput deep RNA-Sequencing (RNA-Seq) was used to obtain 473.9 million pairs of 100 bp length paired-end reads from various life stages and tissues. These reads were assembled into 88,907 contigs using the Trinity strategy and integrated into 38,160 unigenes after redundant sequences were removed. We annotated 11,229 unigenes by searching against the public nr, Swiss-Prot and COG. The EAB transcriptome assembly was compared with 13 other sequenced insect species, resulting in the prediction of 536 unigenes that are Coleoptera-specific. Differential gene expression revealed that 290 unigenes are expressed during larval molting and 3,911 unigenes during metamorphosis from larvae to pupae, respectively (FDR< 0.01 and log2 FC>2). In addition, 1,167 differentially expressed unigenes were identified from larval and adult midguts, 435 unigenes were up-regulated in larval midgut and 732 unigenes were up-regulated in adult midgut. Most of the genes involved in RNA interference (RNAi) pathways were identified, which implies the existence of a system RNAi in EAB. Conclusions and Significance This study provides one of the most fundamental and comprehensive transcriptome resources available for EAB to date. Identification of the tissue- stage- or species- specific unigenes will benefit the further study of gene functions during growth and metamorphosis processes in EAB and other pest insects. PMID:26244979

  10. Accurate Automatic Delineation of Heterogeneous Functional Volumes in Positron Emission Tomography for Oncology Applications

    SciTech Connect

    Hatt, Mathieu; Cheze le Rest, Catherine; Descourt, Patrice; Dekker, Andre; De Ruysscher, Dirk; Oellers, Michel; Lambin, Philippe; Pradier, Olivier; Visvikis, Dimitris

    2010-05-01

    Purpose: Accurate contouring of positron emission tomography (PET) functional volumes is now considered crucial in image-guided radiotherapy and other oncology applications because the use of functional imaging allows for biological target definition. In addition, the definition of variable uptake regions within the tumor itself may facilitate dose painting for dosimetry optimization. Methods and Materials: Current state-of-the-art algorithms for functional volume segmentation use adaptive thresholding. We developed an approach called fuzzy locally adaptive Bayesian (FLAB), validated on homogeneous objects, and then improved it by allowing the use of up to three tumor classes for the delineation of inhomogeneous tumors (3-FLAB). Simulated and real tumors with histology data containing homogeneous and heterogeneous activity distributions were used to assess the algorithm's accuracy. Results: The new 3-FLAB algorithm is able to extract the overall tumor from the background tissues and delineate variable uptake regions within the tumors, with higher accuracy and robustness compared with adaptive threshold (T{sub bckg}) and fuzzy C-means (FCM). 3-FLAB performed with a mean classification error of less than 9% +- 8% on the simulated tumors, whereas binary-only implementation led to errors of 15% +- 11%. T{sub bckg} and FCM led to mean errors of 20% +- 12% and 17% +- 14%, respectively. 3-FLAB also led to more robust estimation of the maximum diameters of tumors with histology measurements, with <6% standard deviation, whereas binary FLAB, T{sub bckg} and FCM lead to 10%, 12%, and 13%, respectively. Conclusion: These encouraging results warrant further investigation in future studies that will investigate the impact of 3-FLAB in radiotherapy treatment planning, diagnosis, and therapy response evaluation.

  11. MaGe: a microbial genome annotation system supported by synteny results.

    PubMed

    Vallenet, David; Labarre, Laurent; Rouy, Zoé; Barbe, Valérie; Bocs, Stéphanie; Cruveiller, Stéphane; Lajus, Aurélie; Pascal, Géraldine; Scarpelli, Claude; Médigue, Claudine

    2006-01-01

    Magnifying Genomes (MaGe) is a microbial genome annotation system based on a relational database containing information on bacterial genomes, as well as a web interface to achieve genome annotation projects. Our system allows one to initiate the annotation of a genome at the early stage of the finishing phase. MaGe's main features are (i) integration of annotation data from bacterial genomes enhanced by a gene coding re-annotation process using accurate gene models, (ii) integration of results obtained with a wide range of bioinformatics methods, among which exploration of gene context by searching for conserved synteny and reconstruction of metabolic pathways, (iii) an advanced web interface allowing multiple users to refine the automatic assignment of gene product functions. MaGe is also linked to numerous well-known biological databases and systems. Our system has been thoroughly tested during the annotation of complete bacterial genomes (Acinetobacter baylyi ADP1, Pseudoalteromonas haloplanktis, Frankia alni) and is currently used in the context of several new microbial genome annotation projects. In addition, MaGe allows for annotation curation and exploration of already published genomes from various genera (e.g. Yersinia, Bacillus and Neisseria). MaGe can be accessed at http://www.genoscope.cns.fr/agc/mage. PMID:16407324

  12. Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions

    PubMed Central

    Araújo, Wagner L.; Nunes-Nesi, Adriano; Williams, Thomas C. R.

    2012-01-01

    The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics, and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review, we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g., photosynthesis, photorespiration, and nitrogen metabolism). We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications. PMID:22973288

  13. Accurate Astrometry and Photometry of Saturated and Coronagraphic Point Spread Functions

    SciTech Connect

    Marois, C; Lafreniere, D; Macintosh, B; Doyon, R

    2006-02-07

    For ground-based adaptive optics point source imaging, differential atmospheric refraction and flexure introduce a small drift of the point spread function (PSF) with time, and seeing and sky transmission variations modify the PSF flux. These effects need to be corrected to properly combine the images and obtain optimal signal-to-noise ratios, accurate relative astrometry and photometry of detected companions as well as precise detection limits. Usually, one can easily correct for these effects by using the PSF core, but this is impossible when high dynamic range observing techniques are used, like coronagraphy with a non-transmissive occulting mask, or if the stellar PSF core is saturated. We present a new technique that can solve these issues by using off-axis satellite PSFs produced by a periodic amplitude or phase mask conjugated to a pupil plane. It will be shown that these satellite PSFs track precisely the PSF position, its Strehl ratio and its intensity and can thus be used to register and to flux normalize the PSF. This approach can be easily implemented in existing adaptive optics instruments and should be considered for future extreme adaptive optics coronagraph instruments and in high-contrast imaging space observatories.

  14. Predicting accurate fluorescent spectra for high molecular weight polycyclic aromatic hydrocarbons using density functional theory

    NASA Astrophysics Data System (ADS)

    Powell, Jacob; Heider, Emily C.; Campiglia, Andres; Harper, James K.

    2016-10-01

    The ability of density functional theory (DFT) methods to predict accurate fluorescence spectra for polycyclic aromatic hydrocarbons (PAHs) is explored. Two methods, PBE0 and CAM-B3LYP, are evaluated both in the gas phase and in solution. Spectra for several of the most toxic PAHs are predicted and compared to experiment, including three isomers of C24H14 and a PAH containing heteroatoms. Unusually high-resolution experimental spectra are obtained for comparison by analyzing each PAH at 4.2 K in an n-alkane matrix. All theoretical spectra visually conform to the profiles of the experimental data but are systematically offset by a small amount. Specifically, when solvent is included the PBE0 functional overestimates peaks by 16.1 ± 6.6 nm while CAM-B3LYP underestimates the same transitions by 14.5 ± 7.6 nm. These calculated spectra can be empirically corrected to decrease the uncertainties to 6.5 ± 5.1 and 5.7 ± 5.1 nm for the PBE0 and CAM-B3LYP methods, respectively. A comparison of computed spectra in the gas phase indicates that the inclusion of n-octane shifts peaks by +11 nm on average and this change is roughly equivalent for PBE0 and CAM-B3LYP. An automated approach for comparing spectra is also described that minimizes residuals between a given theoretical spectrum and all available experimental spectra. This approach identifies the correct spectrum in all cases and excludes approximately 80% of the incorrect spectra, demonstrating that an automated search of theoretical libraries of spectra may eventually become feasible.

  15. Accurate description of van der Waals complexes by density functional theory including empirical corrections.

    PubMed

    Grimme, Stefan

    2004-09-01

    An empirical method to account for van der Waals interactions in practical calculations with the density functional theory (termed DFT-D) is tested for a wide variety of molecular complexes. As in previous schemes, the dispersive energy is described by damped interatomic potentials of the form C6R(-6). The use of pure, gradient-corrected density functionals (BLYP and PBE), together with the resolution-of-the-identity (RI) approximation for the Coulomb operator, allows very efficient computations for large systems. Opposed to previous work, extended AO basis sets of polarized TZV or QZV quality are employed, which reduces the basis set superposition error to a negligible extend. By using a global scaling factor for the atomic C6 coefficients, the functional dependence of the results could be strongly reduced. The "double counting" of correlation effects for strongly bound complexes is found to be insignificant if steep damping functions are employed. The method is applied to a total of 29 complexes of atoms and small molecules (Ne, CH4, NH3, H2O, CH3F, N2, F2, formic acid, ethene, and ethine) with each other and with benzene, to benzene, naphthalene, pyrene, and coronene dimers, the naphthalene trimer, coronene. H2O and four H-bonded and stacked DNA base pairs (AT and GC). In almost all cases, very good agreement with reliable theoretical or experimental results for binding energies and intermolecular distances is obtained. For stacked aromatic systems and the important base pairs, the DFT-D-BLYP model seems to be even superior to standard MP2 treatments that systematically overbind. The good results obtained suggest the approach as a practical tool to describe the properties of many important van der Waals systems in chemistry. Furthermore, the DFT-D data may either be used to calibrate much simpler (e.g., force-field) potentials or the optimized structures can be used as input for more accurate ab initio calculations of the interaction energies.

  16. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.

  17. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  18. Accurate bond energies of biodiesel methyl esters from multireference averaged coupled-pair functional calculations.

    PubMed

    Oyeyemi, Victor B; Keith, John A; Carter, Emily A

    2014-09-01

    Accurate bond dissociation energies (BDEs) are important for characterizing combustion chemistry, particularly the initial stages of pyrolysis. Here we contribute to evaluating the thermochemistry of biodiesel methyl ester molecules using ab initio BDEs derived from a multireference averaged coupled-pair functional (MRACPF2)-based scheme. Having previously validated this approach for hydrocarbons and a variety of oxygenates, herein we provide further validation for bonds within carboxylic acids and methyl esters, finding our scheme predicts BDEs within chemical accuracy (i.e., within 1 kcal/mol) for these molecules. Insights into BDE trends with ester size are then analyzed for methyl formate through methyl crotonate. We find that the carbonyl group in the ester moiety has only a local effect on BDEs. C═C double bonds in ester alkyl chains are found to increase the strengths of bonds adjacent to the double bond. An important exception are bonds beta to C═C or C═O bonds, which produce allylic-like radicals upon dissociation. The observed trends arise from different degrees of geometric relaxation and resonance stabilization in the radicals produced. We also compute BDEs in various small alkanes and alkenes as models for the long hydrocarbon chain of actual biodiesel methyl esters. We again show that allylic bonds in the alkenes are much weaker than those in the small methyl esters, indicating that hydrogen abstractions are more likely at the allylic site and even more likely at bis-allylic sites of alkyl chains due to more electrons involved in π-resonance in the latter. Lastly, we use the BDEs in small surrogates to estimate heretofore unknown BDEs in large methyl esters of biodiesel fuels.

  19. Accurate bond energies of biodiesel methyl esters from multireference averaged coupled-pair functional calculations.

    PubMed

    Oyeyemi, Victor B; Keith, John A; Carter, Emily A

    2014-09-01

    Accurate bond dissociation energies (BDEs) are important for characterizing combustion chemistry, particularly the initial stages of pyrolysis. Here we contribute to evaluating the thermochemistry of biodiesel methyl ester molecules using ab initio BDEs derived from a multireference averaged coupled-pair functional (MRACPF2)-based scheme. Having previously validated this approach for hydrocarbons and a variety of oxygenates, herein we provide further validation for bonds within carboxylic acids and methyl esters, finding our scheme predicts BDEs within chemical accuracy (i.e., within 1 kcal/mol) for these molecules. Insights into BDE trends with ester size are then analyzed for methyl formate through methyl crotonate. We find that the carbonyl group in the ester moiety has only a local effect on BDEs. C═C double bonds in ester alkyl chains are found to increase the strengths of bonds adjacent to the double bond. An important exception are bonds beta to C═C or C═O bonds, which produce allylic-like radicals upon dissociation. The observed trends arise from different degrees of geometric relaxation and resonance stabilization in the radicals produced. We also compute BDEs in various small alkanes and alkenes as models for the long hydrocarbon chain of actual biodiesel methyl esters. We again show that allylic bonds in the alkenes are much weaker than those in the small methyl esters, indicating that hydrogen abstractions are more likely at the allylic site and even more likely at bis-allylic sites of alkyl chains due to more electrons involved in π-resonance in the latter. Lastly, we use the BDEs in small surrogates to estimate heretofore unknown BDEs in large methyl esters of biodiesel fuels. PMID:24621192

  20. On a more accurate half-discrete Hardy-Hilbert-type inequality related to the kernel of arc tangent function.

    PubMed

    Chen, Qiang; Yang, Bicheng

    2016-01-01

    By means of weight functions and Hermite-Hadamard's inequality, and introducing a discrete interval variable, a more accurate half-discrete Hardy-Hilbert-type inequality related to the kernel of arc tangent function and a best possible constant factor is given, which is an extension of a published result. The equivalent forms and the operator expressions are also considered. PMID:27563512

  1. On a more accurate half-discrete Hardy-Hilbert-type inequality related to the kernel of arc tangent function.

    PubMed

    Chen, Qiang; Yang, Bicheng

    2016-01-01

    By means of weight functions and Hermite-Hadamard's inequality, and introducing a discrete interval variable, a more accurate half-discrete Hardy-Hilbert-type inequality related to the kernel of arc tangent function and a best possible constant factor is given, which is an extension of a published result. The equivalent forms and the operator expressions are also considered.

  2. A Coding System with Independent Annotations of Gesture Forms and Functions during Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE)

    PubMed Central

    Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian

    2014-01-01

    Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant’s linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance. PMID:25667563

  3. MM-ISMSA: An Ultrafast and Accurate Scoring Function for Protein-Protein Docking.

    PubMed

    Klett, Javier; Núñez-Salgado, Alfonso; Dos Santos, Helena G; Cortés-Cabrera, Álvaro; Perona, Almudena; Gil-Redondo, Rubén; Abia, David; Gago, Federico; Morreale, Antonio

    2012-09-11

    An ultrafast and accurate scoring function for protein-protein docking is presented. It includes (1) a molecular mechanics (MM) part based on a 12-6 Lennard-Jones potential; (2) an electrostatic component based on an implicit solvent model (ISM) with individual desolvation penalties for each partner in the protein-protein complex plus a hydrogen bonding term; and (3) a surface area (SA) contribution to account for the loss of water contacts upon protein-protein complex formation. The accuracy and performance of the scoring function, termed MM-ISMSA, have been assessed by (1) comparing the total binding energies, the electrostatic term, and its components (charge-charge and individual desolvation energies), as well as the per residue contributions, to results obtained with well-established methods such as APBSA or MM-PB(GB)SA for a set of 1242 decoy protein-protein complexes and (2) testing its ability to recognize the docking solution closest to the experimental structure as that providing the most favorable total binding energy. For this purpose, a test set consisting of 15 protein-protein complexes with known 3D structure mixed with 10 decoys for each complex was used. The correlation between the values afforded by MM-ISMSA and those from the other methods is quite remarkable (r(2) ∼ 0.9), and only 0.2-5.0 s (depending on the number of residues) are spent on a single calculation including an all vs all pairwise energy decomposition. On the other hand, MM-ISMSA correctly identifies the best docking solution as that closest to the experimental structure in 80% of the cases. Finally, MM-ISMSA can process molecular dynamics trajectories and reports the results as averaged values with their standard deviations. MM-ISMSA has been implemented as a plugin to the widely used molecular graphics program PyMOL, although it can also be executed in command-line mode. MM-ISMSA is distributed free of charge to nonprofit organizations.

  4. Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

    PubMed

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Fletcher, Anthony R; Plachetzki, David C; FitzGerald, Peter C; Artieri, Carlo G; Atallah, Joel; Barmina, Olga; Brown, James B; Blankenburg, Kerstin P; Clough, Emily; Dasgupta, Abhijit; Gubbala, Sai; Han, Yi; Jayaseelan, Joy C; Kalra, Divya; Kim, Yoo-Ah; Kovar, Christie L; Lee, Sandra L; Li, Mingmei; Malley, James D; Malone, John H; Mathew, Tittu; Mattiuzzo, Nicolas R; Munidasa, Mala; Muzny, Donna M; Ongeri, Fiona; Perales, Lora; Przytycka, Teresa M; Pu, Ling-Ling; Robinson, Garrett; Thornton, Rebecca L; Saada, Nehad; Scherer, Steven E; Smith, Harold E; Vinson, Charles; Warner, Crystal B; Worley, Kim C; Wu, Yuan-Qing; Zou, Xiaoyan; Cherbas, Peter; Kellis, Manolis; Eisen, Michael B; Piano, Fabio; Kionte, Karin; Fitch, David H; Sternberg, Paul W; Cutter, Asher D; Duff, Michael O; Hoskins, Roger A; Graveley, Brenton R; Gibbs, Richard A; Bickel, Peter J; Kopp, Artyom; Carninci, Piero; Celniker, Susan E; Oliver, Brian; Richards, Stephen

    2014-07-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.

  5. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs.

    PubMed

    Liu, Xiaoming; Wu, Chunlei; Li, Chang; Boerwinkle, Eric

    2016-03-01

    The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study. A list of all potential nsSNVs and ssSNVs based on the human reference sequence were created and functional predictions and annotations were curated and compiled for each SNV. Here, we report a recent major update of the database to version 3.0. The SNV list has been rebuilt based on GENCODE 22 and currently the database includes 82,832,027 nsSNVs and ssSNVs. An attached database dbscSNV, which compiled all potential human SNVs within splicing consensus regions and their deleteriousness predictions, add another 15,030,459 potentially functional SNVs. Eleven prediction scores (MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4× fitCons, fathmm-MKL, and DANN) and allele frequencies from the UK10K cohorts and the Exome Aggregation Consortium (ExAC), among others, have been added. The original seven prediction scores in v2.0 (SIFT, 2× Polyphen2, LRT, MutationTaster, MutationAssessor, and FATHMM) as well as many SNV and gene functional annotations have been updated. dbNSFP v3.0 is freely available at http://sites.google.com/site/jpopgen/dbNSFP. PMID:26555599

  6. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs.

    PubMed

    Liu, Xiaoming; Wu, Chunlei; Li, Chang; Boerwinkle, Eric

    2016-03-01

    The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study. A list of all potential nsSNVs and ssSNVs based on the human reference sequence were created and functional predictions and annotations were curated and compiled for each SNV. Here, we report a recent major update of the database to version 3.0. The SNV list has been rebuilt based on GENCODE 22 and currently the database includes 82,832,027 nsSNVs and ssSNVs. An attached database dbscSNV, which compiled all potential human SNVs within splicing consensus regions and their deleteriousness predictions, add another 15,030,459 potentially functional SNVs. Eleven prediction scores (MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4× fitCons, fathmm-MKL, and DANN) and allele frequencies from the UK10K cohorts and the Exome Aggregation Consortium (ExAC), among others, have been added. The original seven prediction scores in v2.0 (SIFT, 2× Polyphen2, LRT, MutationTaster, MutationAssessor, and FATHMM) as well as many SNV and gene functional annotations have been updated. dbNSFP v3.0 is freely available at http://sites.google.com/site/jpopgen/dbNSFP.

  7. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. PMID:27342282

  8. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

  9. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  10. Arterial Input Function Placement for Accurate CT Perfusion Map Construction in Acute Stroke

    PubMed Central

    Ferreira, Rafael M.; Lev, Michael H.; Goldmakher, Gregory V.; Kamalian, Shahmir; Schaefer, Pamela W.; Furie, Karen L.; Gonzalez, R. Gilberto; Sanelli, Pina C.

    2013-01-01

    OBJECTIVE The objective of our study was to evaluate the effect of varying arterial input function (AIF) placement on the qualitative and quantitative CT perfusion parameters. MATERIALS AND METHODS Retrospective analysis of CT perfusion data was performed on 14 acute stroke patients with a proximal middle cerebral artery (MCA) clot. Cerebral blood flow (CBF), cerebral blood volume (CBV), and mean transit time (MTT) maps were constructed using a systematic method by varying only the AIF placement in four positions relative to the MCA clot including proximal and distal to the clot in the ipsilateral and contralateral hemispheres. Two postprocessing software programs were used to evaluate the effect of AIF placement on perfusion parameters using a delay-insensitive deconvolution method compared with a standard deconvolution method. RESULTS One hundred sixty-eight CT perfusion maps were constructed for each software package. Both software programs generated a mean CBF at the infarct core of < 12 mL/100 g/min and a mean CBV of < 2 mL/100 g for AIF placement proximal to the clot in the ipsilateral hemisphere and proximal and distal to the clot in the contralateral hemisphere. For AIF placement distal to the clot in the ipsilateral hemisphere, the mean CBF significantly increased to 17.3 mL/100 g/min with delay-insensitive software and to 19.4 mL/100 g/min with standard software (p < 0.05). The mean MTT was significantly decreased for this AIF position. Furthermore, this AIF position yielded qualitatively different parametric maps, being most pronounced with MTT and CBF. Overall, CBV was least affected by AIF location. CONCLUSION For postprocessing of accurate quantitative CT perfusion maps, laterality of the AIF location is less important than avoiding AIF placement distal to the clot as detected on CT angiography. This pitfall is less severe with deconvolution-based software programs using a delay-insensitive technique than with those using a standard deconvolution

  11. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper

  12. A non-empirical, parameter-free, hybrid functional for accurate calculations of optoelectronic properties of finite systems

    NASA Astrophysics Data System (ADS)

    Brawand, Nicholas; Vörös, Márton; Govoni, Marco; Galli, Giulia

    The accurate prediction of optoelectronic properties of molecules and solids is a persisting challenge for current density functional theory (DFT) based methods. We propose a hybrid functional where the mixing fraction of exact and local exchange is determined by a non-empirical, system dependent function. This functional yields ionization potentials, fundamental and optical gaps of many, diverse systems in excellent agreement with experiments, including organic and inorganic molecules and nanocrystals. We further demonstrate that the newly defined hybrid functional gives the correct alignment between the energy level of the exemplary TTF-TCNQ donor-acceptor system. DOE-BES: DE-FG02-06ER46262.

  13. Accurate localization of supernumerary mediastinal parathyroid adenomas by a combination of structural and functional imaging.

    PubMed

    Mackie, G C; Schlicht, S M

    2004-09-01

    Reoperation for refractory or recurrent hyperparathyroidism following parathyroidectomy carries the potential for increased morbidity and the possibility of failure to localize and remove the lesion intraoperatively. Reported herein are three cases demonstrating the combined use of sestamibi scintigraphy, CT and MR for accurate localization of mediastinal parathyroid adenomas.

  14. Objective-guided image annotation.

    PubMed

    Mao, Qi; Tsang, Ivor Wai-Hung; Gao, Shenghua

    2013-04-01

    Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined. And then, we formulate these loss functions as relaxed surrogate functions and optimize them by structural SVMs. According to the analysis of various measures and the high time complexity of optimizing micro-averaging measures, in this paper, we focus on example-based measures that are tailor-made for image annotation tasks but are seldom explored in the literature. Experiments show consistency with the formal analysis on two widely used multi-label datasets, and demonstrate the superior performance of our proposed method over state-of-the-art baseline methods in terms of example-based measures on four

  15. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    SciTech Connect

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.; Porwollik, Steffen; Jones, Marcus B.; Yoon, Hyunjin; Payne, Samuel H.; Martin, Jessica L.; Burnet, Meagan C.; Monroe, Matthew E.; Venepally, Pratap; Smith, Richard D.; Peterson, Scott; Heffron, Fred; Mcclelland, Michael; Adkins, Joshua N.

    2011-08-25

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify coding regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.

  16. The DNA sequence and biological annotation of human chromosome 1.

    PubMed

    Gregory, S G; Barlow, K F; McLay, K E; Kaul, R; Swarbreck, D; Dunham, A; Scott, C E; Howe, K L; Woodfine, K; Spencer, C C A; Jones, M C; Gillson, C; Searle, S; Zhou, Y; Kokocinski, F; McDonald, L; Evans, R; Phillips, K; Atkinson, A; Cooper, R; Jones, C; Hall, R E; Andrews, T D; Lloyd, C; Ainscough, R; Almeida, J P; Ambrose, K D; Anderson, F; Andrew, R W; Ashwell, R I S; Aubin, K; Babbage, A K; Bagguley, C L; Bailey, J; Beasley, H; Bethel, G; Bird, C P; Bray-Allen, S; Brown, J Y; Brown, A J; Buckley, D; Burton, J; Bye, J; Carder, C; Chapman, J C; Clark, S Y; Clarke, G; Clee, C; Cobley, V; Collier, R E; Corby, N; Coville, G J; Davies, J; Deadman, R; Dunn, M; Earthrowl, M; Ellington, A G; Errington, H; Frankish, A; Frankland, J; French, L; Garner, P; Garnett, J; Gay, L; Ghori, M R J; Gibson, R; Gilby, L M; Gillett, W; Glithero, R J; Grafham, D V; Griffiths, C; Griffiths-Jones, S; Grocock, R; Hammond, S; Harrison, E S I; Hart, E; Haugen, E; Heath, P D; Holmes, S; Holt, K; Howden, P J; Hunt, A R; Hunt, S E; Hunter, G; Isherwood, J; James, R; Johnson, C; Johnson, D; Joy, A; Kay, M; Kershaw, J K; Kibukawa, M; Kimberley, A M; King, A; Knights, A J; Lad, H; Laird, G; Lawlor, S; Leongamornlert, D A; Lloyd, D M; Loveland, J; Lovell, J; Lush, M J; Lyne, R; Martin, S; Mashreghi-Mohammadi, M; Matthews, L; Matthews, N S W; McLaren, S; Milne, S; Mistry, S; Moore, M J F; Nickerson, T; O'Dell, C N; Oliver, K; Palmeiri, A; Palmer, S A; Parker, A; Patel, D; Pearce, A V; Peck, A I; Pelan, S; Phelps, K; Phillimore, B J; Plumb, R; Rajan, J; Raymond, C; Rouse, G; Saenphimmachak, C; Sehra, H K; Sheridan, E; Shownkeen, R; Sims, S; Skuce, C D; Smith, M; Steward, C; Subramanian, S; Sycamore, N; Tracey, A; Tromans, A; Van Helmond, Z; Wall, M; Wallis, J M; White, S; Whitehead, S L; Wilkinson, J E; Willey, D L; Williams, H; Wilming, L; Wray, P W; Wu, Z; Coulson, A; Vaudin, M; Sulston, J E; Durbin, R; Hubbard, T; Wooster, R; Dunham, I; Carter, N P; McVean, G; Ross, M T; Harrow, J; Olson, M V; Beck, S; Rogers, J; Bentley, D R; Banerjee, R; Bryant, S P; Burford, D C; Burrill, W D H; Clegg, S M; Dhami, P; Dovey, O; Faulkner, L M; Gribble, S M; Langford, C F; Pandian, R D; Porter, K M; Prigmore, E

    2006-05-18

    The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome.

  17. Multiconfiguration Pair-Density Functional Theory Is as Accurate as CASPT2 for Electronic Excitation.

    PubMed

    Hoyer, Chad E; Ghosh, Soumen; Truhlar, Donald G; Gagliardi, Laura

    2016-02-01

    A correct description of electronically excited states is critical to the interpretation of visible-ultraviolet spectra, photochemical reactions, and excited-state charge-transfer processes in chemical systems. We have recently proposed a theory called multiconfiguration pair-density functional theory (MC-PDFT), which is based on a combination of multiconfiguration wave function theory and a new kind of density functional called an on-top density functional. Here, we show that MC-PDFT with a first-generation on-top density functional performs as well as CASPT2 for an organic chemistry database including valence, Rydberg, and charge-transfer excitations. The results are very encouraging for practical applications. PMID:26794241

  18. An automated protein annotation filter for integrating web-based annotation tools

    PubMed Central

    Saravanan, Vijayakumar; Shanmughavel, Primanayagam

    2007-01-01

    A wide range of web based prediction and annotation tools are frequently used for determining protein function from sequence. However, parallel processing of sequences for annotation through web tools is not possible due to several constraints in functional programming for multiple queries. Here, we propose the development of APAF as an automated protein annotation filter to overcome some of these difficulties through an integrated approach. PMID:18188426

  19. An automated protein annotation filter for integrating web-based annotation tools.

    PubMed

    Saravanan, Vijayakumar; Shanmughavel, Primanayagam

    2007-12-15

    A wide range of web based prediction and annotation tools are frequently used for determining protein function from sequence. However, parallel processing of sequences for annotation through web tools is not possible due to several constraints in functional programming for multiple queries. Here, we propose the development of APAF as an automated protein annotation filter to overcome some of these difficulties through an integrated approach.

  20. Applicability of density functional theory in reproducing accurate vibrational spectra of surface bound species.

    PubMed

    Matanović, Ivana; Atanassov, Plamen; Kiefer, Boris; Garzon, Fernando H; Henson, Neil J

    2014-10-01

    The structural equilibrium parameters, the adsorption energies, and the vibrational frequencies of the nitrogen molecule and the hydrogen atom adsorbed on the (111) surface of rhodium have been investigated using different generalized-gradient approximation (GGA), nonlocal correlation, meta-GGA, and hybrid functionals, namely, Perdew, Burke, and Ernzerhof (PBE), Revised-RPBE, vdW-DF, Tao, Perdew, Staroverov, and Scuseria functional (TPSS), and Heyd, Scuseria, and Ernzerhof (HSE06) functional in the plane wave formalism. Among the five tested functionals, nonlocal vdW-DF and meta-GGA TPSS functionals are most successful in describing energetics of dinitrogen physisorption to the Rh(111) surface, while the PBE functional provides the correct chemisorption energy for the hydrogen atom. It was also found that TPSS functional produces the best vibrational spectra of the nitrogen molecule and the hydrogen atom on rhodium within the harmonic formalism with the error of -2.62 and -1.1% for the N-N stretching and Rh-H stretching frequency. Thus, TPSS functional was proposed as a method of choice for obtaining vibrational spectra of low weight adsorbates on metallic surfaces within the harmonic approximation. At the anharmonic level, by decoupling the Rh-H and N-N stretching modes from the bulk phonons and by solving one- and two-dimensional Schrödinger equation associated with the Rh-H, Rh-N, and N-N potential energy we calculated the anharmonic correction for N-N and Rh-H stretching modes as -31 cm(-1) and -77 cm(-1) at PBE level. Anharmonic vibrational frequencies calculated with the use of the hybrid HSE06 function are in best agreement with available experiments. PMID:25164265

  1. Applicability of Density Functional Theory in Reproducing Accurate Vibrational Spectra of Surface Bound Species

    SciTech Connect

    Matanovic, Ivana; Atanassov, Plamen; Kiefer, Boris; Garzon, Fernando; Henson, Neil J.

    2014-10-05

    The structural equilibrium parameters, the adsorption energies, and the vibrational frequencies of the nitrogen molecule and the hydrogen atom adsorbed on the (111) surface of rhodium have been investigated using different generalized-gradient approximation (GGA), nonlocal correlation, meta-GGA, and hybrid functionals, namely, Perdew, Burke, and Ernzerhof (PBE), Revised-RPBE, vdW-DF, Tao, Perdew, Staroverov, and Scuseria functional (TPSS), and Heyd, Scuseria, and Ernzerhof (HSE06) functional in the plane wave formalism. Among the five tested functionals, nonlocal vdW-DF and meta-GGA TPSS functionals are most successful in describing energetics of dinitrogen physisorption to the Rh(111) surface, while the PBE functional provides the correct chemisorption energy for the hydrogen atom. It was also found that TPSS functional produces the best vibrational spectra of the nitrogen molecule and the hydrogen atom on rhodium within the harmonic formalism with the error of 22.62 and 21.1% for the NAN stretching and RhAH stretching frequency. Thus, TPSS functional was proposed as a method of choice for obtaining vibrational spectra of low weight adsorbates on metallic surfaces within the harmonic approximation. At the anharmonic level, by decoupling the RhAH and NAN stretching modes from the bulk phonons and by solving one- and two-dimensional Schr€odinger equation associated with the RhAH, RhAN, and NAN potential energy we calculated the anharmonic correction for NAN and RhAH stretching modes as 231 cm21 and 277 cm21 at PBE level. Anharmonic vibrational frequencies calculated with the use of the hybrid HSE06 function are in best agreement with available experiments.

  2. Conservation and function of Rab small GTPases in Entamoeba: annotation of E. invadens Rab and its use for the understanding of Entamoeba biology.

    PubMed

    Nakada-Tsukui, Kumiko; Saito-Nakano, Yumiko; Husain, Afzal; Nozaki, Tomoyoshi

    2010-11-01

    Entamoeba invadens is a reptilian enteric protozoan parasite closely related to the human pathogen Entamoeba histolytica and a good model organism of encystation. To understand the molecular mechanism of vesicular trafficking involved in the encystation of Entamoeba, we examined the conservation of Rab small GTPases between the two species. E. invadens has over 100 Rab genes, similar to E. histolytica. Most of the Rab subfamilies are conserved between the two species, while a number of species-specific Rabs are also present. We annotated all E. invadens Rabs according to the previous nomenclature [Saito-Nakano, Y., Loftus, B.J., Hall, N., Nozaki, T., 2005. The diversity of Rab GTPases in Entamoeba histolytica. Experimental Parasitology 110, 244-252]. Comparative genomic analysis suggested that the fundamental vesicular traffic machinery is well conserved, while there are species-specific protein transport mechanisms. We also reviewed the function of Rabs in Entamoeba, and proposed the use of the annotation of E. invadens Rab genes to understand the ubiquitous importance of Rab-mediated membrane trafficking during important biological processes including differentiation in Entamoeba.

  3. New density functional parameterizations to accurate calculations of electric field gradient variations among compounds.

    PubMed

    Santiago, Régis Tadeu; Haiduke, Roberto Luiz Andrade

    2015-10-30

    This research provides a performance investigation of density functional theory and also proposes new functional parameterizations to deal with electric field gradient (EFG) calculations at nuclear positions. The entire procedure is conducted within the four-component formalism. First, we noticed that traditional hybrid and long-range corrected functionals are more efficient in the description of EFG variations for a set of elements (indium, antimony, iodine, lutetium, and hafnium) among linear molecules. Thus, we selected the PBE0, B3LYP, and CAM-B3LYP functionals and promoted a reoptimization of their parameters for a better description of these EFG changes. The PBE0q variant developed here showed an overall promising performance in a validation test conducted with potassium, iodine, copper, and gold. In general, the correlation coefficients found in linear regressions between experimental nuclear quadrupole coupling constants and calculated EFGs are improved while the systematic EFG errors also decrease as a result of this reparameterization. PMID:26284820

  4. Improving DOE-2's RESYS routine: User defined functions to provide more accurate part load energy use and humidity predictions

    SciTech Connect

    Henderson, Hugh I.; Parker, Danny; Huang, Yu J.

    2000-08-04

    In hourly energy simulations, it is important to properly predict the performance of air conditioning systems over a range of full and part load operating conditions. An important component of these calculations is to properly consider the performance of the cycling air conditioner and how it interacts with the building. This paper presents improved approaches to properly account for the part load performance of residential and light commercial air conditioning systems in DOE-2. First, more accurate correlations are given to predict the degradation of system efficiency at part load conditions. In addition, a user-defined function for RESYS is developed that provides improved predictions of air conditioner sensible and latent capacity at part load conditions. The user function also provides more accurate predictions of space humidity by adding ''lumped'' moisture capacitance into the calculations. The improved cooling coil model and the addition of moisture capacitance predicts humidity swings that are more representative of the performance observed in real buildings.

  5. Accurate and efficient calculation of discrete correlation functions and power spectra

    NASA Astrophysics Data System (ADS)

    Xu, Y. F.; Liu, J. M.; Zhu, W. D.

    2015-07-01

    Operational modal analysis (OMA), or output-only modal analysis, has been widely conducted especially when excitation applied on a structure is unknown or difficult to measure. Discrete cross-correlation functions and cross-power spectra between a reference data series and measured response data series are bases for OMA to identify modal properties of a structure. Such functions and spectra can be efficiently transformed from each other using the discrete Fourier transform (DFT) and inverse DFT (IDFT) based on the cross-correlation theorem. However, a direct application of the theorem and transforms, including the DFT and IDFT, can yield physically erroneous results due to periodic extension of the DFT on a function of a finite length to be transformed, which is false most of the time. Padding zero series to ends of data series before applying the theorem and transforms can reduce the errors, but the results are still physically erroneous. A new methodology is developed in this work to calculate discrete cross-correlation functions of non-negative time delays and associated cross-power spectra, referred to as half spectra, for OMA. The methodology can be extended to cross-correlation functions of any time delays and associated cross-power spectra, referred to as full spectra. The new methodology is computationally efficient due to use of the transforms. Data series are properly processed to avoid the errors caused by the periodic extension, and the resulting cross-correlation functions and associated cross-power spectra perfectly comply with their definitions. A coherence function, a convergence function, and a convergence index are introduced to evaluate qualities of measured cross-correlation functions and associated cross-power spectra. The new methodology was numerically and experimentally applied to an ideal two-degree-of-freedom (2-DOF) mass-spring-damper system and a damaged aluminum beam, respectively, and OMA was conducted using half spectra to estimate

  6. A simple and accurate grading system for orthoiodohippurate renal scans in the assessment of post-transplant renal function

    SciTech Connect

    Zaki, S.K.; Bretan, P.N.; Go, R.T.; Rehm, P.K.; Streem, S.B.; Novick, A.C. )

    1990-06-01

    Orthoiodohippurate renal scanning has proved to be a reliable, noninvasive method for the evaluation and followup of renal allograft function. However, a standardized system for grading renal function with this test is not available. We propose a simple grading system to distinguish the different functional phases of hippurate scanning in renal transplant recipients. This grading system was studied in 138 patients who were evaluated 1 week after renal transplantation. There was a significant correlation between the isotope renographic functional grade and clinical correlates of allograft function such as the serum creatinine level (p = 0.0001), blood urea nitrogen level (p = 0.0001), urine output (p = 0.005) and need for hemodialysis (p = 0.007). We recommend this grading system as a simple and accurate method to interpret orthoiodohippurate renal scans in the evaluation and followup of renal allograft recipients.

  7. An extended set of yeast-based functional assays accurately identifies human disease mutations

    PubMed Central

    Sun, Song; Yang, Fan; Tan, Guihong; Costanzo, Michael; Oughtred, Rose; Hirschman, Jodi; Theesfeld, Chandra L.; Bansal, Pritpal; Sahni, Nidhi; Yi, Song; Yu, Analyn; Tyagi, Tanya; Tie, Cathy; Hill, David E.; Vidal, Marc; Andrews, Brenda J.; Boone, Charles; Dolinski, Kara; Roth, Frederick P.

    2016-01-01

    We can now routinely identify coding variants within individual human genomes. A pressing challenge is to determine which variants disrupt the function of disease-associated genes. Both experimental and computational methods exist to predict pathogenicity of human genetic variation. However, a systematic performance comparison between them has been lacking. Therefore, we developed and exploited a panel of 26 yeast-based functional complementation assays to measure the impact of 179 variants (101 disease- and 78 non-disease-associated variants) from 22 human disease genes. Using the resulting reference standard, we show that experimental functional assays in a 1-billion-year diverged model organism can identify pathogenic alleles with significantly higher precision and specificity than current computational methods. PMID:26975778

  8. An extended set of yeast-based functional assays accurately identifies human disease mutations.

    PubMed

    Sun, Song; Yang, Fan; Tan, Guihong; Costanzo, Michael; Oughtred, Rose; Hirschman, Jodi; Theesfeld, Chandra L; Bansal, Pritpal; Sahni, Nidhi; Yi, Song; Yu, Analyn; Tyagi, Tanya; Tie, Cathy; Hill, David E; Vidal, Marc; Andrews, Brenda J; Boone, Charles; Dolinski, Kara; Roth, Frederick P

    2016-05-01

    We can now routinely identify coding variants within individual human genomes. A pressing challenge is to determine which variants disrupt the function of disease-associated genes. Both experimental and computational methods exist to predict pathogenicity of human genetic variation. However, a systematic performance comparison between them has been lacking. Therefore, we developed and exploited a panel of 26 yeast-based functional complementation assays to measure the impact of 179 variants (101 disease- and 78 non-disease-associated variants) from 22 human disease genes. Using the resulting reference standard, we show that experimental functional assays in a 1-billion-year diverged model organism can identify pathogenic alleles with significantly higher precision and specificity than current computational methods. PMID:26975778

  9. An extended set of yeast-based functional assays accurately identifies human disease mutations.

    PubMed

    Sun, Song; Yang, Fan; Tan, Guihong; Costanzo, Michael; Oughtred, Rose; Hirschman, Jodi; Theesfeld, Chandra L; Bansal, Pritpal; Sahni, Nidhi; Yi, Song; Yu, Analyn; Tyagi, Tanya; Tie, Cathy; Hill, David E; Vidal, Marc; Andrews, Brenda J; Boone, Charles; Dolinski, Kara; Roth, Frederick P

    2016-05-01

    We can now routinely identify coding variants within individual human genomes. A pressing challenge is to determine which variants disrupt the function of disease-associated genes. Both experimental and computational methods exist to predict pathogenicity of human genetic variation. However, a systematic performance comparison between them has been lacking. Therefore, we developed and exploited a panel of 26 yeast-based functional complementation assays to measure the impact of 179 variants (101 disease- and 78 non-disease-associated variants) from 22 human disease genes. Using the resulting reference standard, we show that experimental functional assays in a 1-billion-year diverged model organism can identify pathogenic alleles with significantly higher precision and specificity than current computational methods.

  10. Annotating Enzymes of Uncertain Function: The Deacylation of d-Amino Acids by Members of the Amidohydrolase Superfamily

    SciTech Connect

    Cummings, J.; Fedorov, A; Xu, C; Brown, S; Fedorov, E; Babbitt, P; Almo, S; Raushel, F

    2009-01-01

    The catalytic activities of three members of the amidohydrolase superfamily were discovered using amino acid substrate libraries. Bb3285 from Bordetella bronchiseptica, Gox1177 from Gluconobacter oxidans, and Sco4986 from Streptomyces coelicolor are currently annotated as d-aminoacylases or N-acetyl-d-glutamate deacetylases. These three enzymes are 22-34% identical to one another in amino acid sequence. Substrate libraries containing nearly all combinations of N-formyl-d-Xaa, N-acetyl-d-Xaa, N-succinyl-d-Xaa, and l-Xaa-d-Xaa were used to establish the substrate profiles for these enzymes. It was demonstrated that Bb3285 is restricted to the hydrolysis of N-acyl-substituted derivatives of d-glutamate. The best substrates for this enzyme are N-formyl-d-glutamate (k{sub cat}/K{sub m} = 5.8 x 10{sup 6} M{sup -1} s{sup -1}), N-acetyl-d-glutamate (k{sub cat}/K{sub m} = 5.2 x 10{sup 6} M{sup -1} s{sup -1}), and l-methionine-d-glutamate (k{sub cat}/K{sub m} = 3.4 x 10{sup 5} M{sup -1} s{sup -1}). Gox1177 and Sco4986 preferentially hydrolyze N-acyl-substituted derivatives of hydrophobic d-amino acids. The best substrates for Gox1177 are N-acetyl-d-leucine (k{sub cat}/K{sub m} = 3.2 x 104 M{sup -1} s-1), N-acetyl-d-tryptophan (kcat/Km = 4.1 x 104 M-1 s-1), and l-tyrosine-d-leucine (kcat/Km = 1.5 x 104 M-1 s-1). A fourth protein, Bb2785 from B. bronchiseptica, did not have d-aminoacylase activity. The best substrates for Sco4986 are N-acetyl-d-phenylalanine and N-acetyl-d-tryptophan. The three-dimensional structures of Bb3285 in the presence of the product acetate or a potent mimic of the tetrahedral intermediate were determined by X-ray diffraction methods. The side chain of the d-glutamate moiety of the inhibitor is ion-paired to Arg-295, while the {alpha}-carboxylate is ion-paired with Lys-250 and Arg-376. These results have revealed the chemical and structural determinants for substrate specificity in this protein. Bioinformatic analyses of an additional {approx}250

  11. Accumulation, functional annotation, and comparative analysis of expressed sequence tags in eggplant (Solanum melongena L.), the third pole of the genus Solanum species after tomato and potato.

    PubMed

    Fukuoka, Hiroyuki; Yamaguchi, Hirotaka; Nunome, Tsukasa; Negoro, Satomi; Miyatake, Koji; Ohyama, Akio

    2010-01-15

    Eggplant (Solanum melongena L.) is a widely grown vegetable crop that belongs to the genus Solanum, which is comprised of more than 1000 species of wide genetic and phenotypic variation. Unlike tomato and potato, Solanum crops that belong to subgenus Potatoe and have been targets for comprehensive genomic studies, eggplant is endemic to the Old World and belongs to a different subgenus, Leptostemonum, and therefore, would be a unique member for comparative molecular biology in Solanum. In this study, more than 60,000 eggplant cDNA clones from various tissues and treatments were sequenced from both the 5'- and 3'-ends, and a unigene set consisting of 16,245 unique sequences was constructed. Functional annotations based on sequence similarity to known plant reference datasets revealed a distribution of functional categories almost similar to that of tomato, while 1316 unigenes were suggested to be eggplant-specific. Sequence-based comparative analysis using putative orthologous gene groups setup by reciprocal sequence comparison among six solanaceous species suggested that eggplant and its wild ally Solanum torvum were clustered separately from subgenus Potatoe species, and then, all Solanum species were clustered separately from the genus Capsicum. Microsatellite motif distribution was different among species and likely to be coincident with the phylogenetic relationships. Furthermore, the eggplant unigene dataset exhibited its utility in transcriptome analysis by the SAGE strategy where a considerable number of short tag sequences of interest were successfully assigned to unigenes and their functional annotations. The eggplant ESTs and 16k unigene set developed in this study would be a useful resource not only for molecular genetics and breeding in eggplant itself, but for expanding the scope of comparative biology in Solanum species.

  12. Communication: a density functional with accurate fractional-charge and fractional-spin behaviour for s-electrons.

    PubMed

    Johnson, Erin R; Contreras-García, Julia

    2011-08-28

    We develop a new density-functional approach combining physical insight from chemical structure with treatment of multi-reference character by real-space modeling of the exchange-correlation hole. We are able to recover, for the first time, correct fractional-charge and fractional-spin behaviour for atoms of groups 1 and 2. Based on Becke's non-dynamical correlation functional [A. D. Becke, J. Chem. Phys. 119, 2972 (2003)] and explicitly accounting for core-valence separation and pairing effects, this method is able to accurately describe dissociation and strong correlation in s-shell many-electron systems.

  13. A method for the accurate and smooth approximation of standard thermodynamic functions

    NASA Astrophysics Data System (ADS)

    Coufal, O.

    2013-01-01

    A method is proposed for the calculation of approximations of standard thermodynamic functions. The method is consistent with the physical properties of standard thermodynamic functions. This means that the approximation functions are, in contrast to the hitherto used approximations, continuous and smooth in every temperature interval in which no phase transformations take place. The calculation algorithm was implemented by the SmoothSTF program in the C++ language which is part of this paper. Program summaryProgram title:SmoothSTF Catalogue identifier: AENH_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AENH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3807 No. of bytes in distributed program, including test data, etc.: 131965 Distribution format: tar.gz Programming language: C++. Computer: Any computer with gcc version 4.3.2 compiler. Operating system: Debian GNU Linux 6.0. The program can be run in operating systems in which the gcc compiler can be installed, see http://gcc.gnu.org/install/specific.html. RAM: 256 MB are sufficient for the table of standard thermodynamic functions with 500 lines Classification: 4.9. Nature of problem: Standard thermodynamic functions (STF) of individual substances are given by thermal capacity at constant pressure, entropy and enthalpy. STF are continuous and smooth in every temperature interval in which no phase transformations take place. The temperature dependence of STF as expressed by the table of its values is for further application approximated by temperature functions. In the paper, a method is proposed for calculating approximation functions which, in contrast to the hitherto used approximations, are continuous and smooth in every temperature interval. Solution method: The approximation functions are

  14. Neural network approach to quantum-chemistry data: Accurate prediction of density functional theory energies

    NASA Astrophysics Data System (ADS)

    Balabin, Roman M.; Lomakina, Ekaterina I.

    2009-08-01

    Artificial neural network (ANN) approach has been applied to estimate the density functional theory (DFT) energy with large basis set using lower-level energy values and molecular descriptors. A total of 208 different molecules were used for the ANN training, cross validation, and testing by applying BLYP, B3LYP, and BMK density functionals. Hartree-Fock results were reported for comparison. Furthermore, constitutional molecular descriptor (CD) and quantum-chemical molecular descriptor (QD) were used for building the calibration model. The neural network structure optimization, leading to four to five hidden neurons, was also carried out. The usage of several low-level energy values was found to greatly reduce the prediction error. An expected error, mean absolute deviation, for ANN approximation to DFT energies was 0.6±0.2 kcal mol-1. In addition, the comparison of the different density functionals with the basis sets and the comparison of multiple linear regression results were also provided. The CDs were found to overcome limitation of the QD. Furthermore, the effective ANN model for DFT/6-311G(3df,3pd) and DFT/6-311G(2df,2pd) energy estimation was developed, and the benchmark results were provided.

  15. Accurate Zero Parameter Correlation Energy Functional Obtained from the Homogeneous Electron Gas with an Energy Gap

    NASA Astrophysics Data System (ADS)

    Krieger, J. B.; Chen, Jiqiang; Iafrate, G. J.; Savin, A.

    1998-03-01

    We have obtained an analytic approximation to E_c(r_g, ζ,G) where G is an energy gap separating the occupied and unoccupied states of a homogeneous electron gas for ζ=3D0 and ξ=3D1. When G=3D0, E_c(r_g, ζ) reduces to the usual LSD result. This functional is employed in calculating correlation energies for unpolarized atoms and ions for Z <= 18 by taking G[n]=3D1/8|nabla ln n|^2, which reduces to the ionization energy in the large r limit in an exact Kohn-Sham (KS) theory. The resulting functional is self-interaction-corrected employing a method which is invariant under a unitary transformation. We find that the application of this approach to the calculation of the Ec functional reduces the error in the LSD result by more than 95%. When the value of G is approximately corrected to include the effect of higher lying unoccupied localized states, the resulting values of Ec are within a few percent of the exact results.

  16. Genomic Data and Annotation from the SEED

    DOE Data Explorer

    Fonstein, Michael; Kogan, Yakov; Osterman, Andrei; Overbeek, Ross; Vonstein, Veronika The Fellowship for Interpretation of Genomes (FIG)

    The SEED Project is a cooperative effort to annotate ever-expanding genomic data so researchers can conduct effective comparative analyses of genomes. Launched in 2003 by the Fellowship for Interpretation of Genomes (FIG), the project is one of several initiatives in ongoing development of data curation systems. SEED is designed to be used by scientists from numerous centers and with varied research objectives. As such, several institutions have since joined FIG in a consortium, including the University of Chicago, DOE’s Argonne National Laboratory (ANL), the University of Illinois at Urbana-Champaign, and others. As one example, ANL has used SEED to develop the National Microbial Pathogen Data Resource. Other agencies and institutions have used the project to discover genome components and clarify gene functions such as metabolism. SEED also has enabled researchers to conduct comparative analyses of closely related genomes and has supported derivation of stoichiometric models to understand metabolic processes. The SEED Project has been extended to support metagenomic samples and concomitant analytical tools. Moreover, the number of genomes being introduced into SEED is growing very rapidly. Building a framework to support this growth while providing highly accurate annotations is centrally important to SEED. The project’s subsystem-based annotation strategy has become the technological foundation for addressing these challenges.(copied from Appendix 7 of Systems Biology Knowledgebase for a New Era in Biology, A Genomics:GTL Report from the May 2008 Workshop, DOE/SC-0113, Grequrick, S; Fredrickson, J.K.; Stevens, R., Pub March 1, 2009.)

  17. Accurate Diels-Alder reaction energies from efficient density functional calculations.

    PubMed

    Mezei, Pál D; Csonka, Gábor I; Kállay, Mihály

    2015-06-01

    We assess the performance of the semilocal PBE functional; its global hybrid variants; the highly parametrized empirical M06-2X and M08-SO; the range separated rCAM-B3LYP and MCY3; the atom-pairwise or nonlocal dispersion corrected semilocal PBE and TPSS; the dispersion corrected range-separated ωB97X-D; the dispersion corrected double hybrids such as PWPB95-D3; the direct random phase approximation, dRPA, with Hartree-Fock, Perdew-Burke-Ernzerhof, and Perdew-Burke-Ernzerhof hybrid reference orbitals and the RPAX2 method based on a Perdew-Burke-Ernzerhof exchange reference orbitals for the Diels-Alder, DARC; and self-interaction error sensitive, SIE11, reaction energy test sets with large, augmented correlation consistent valence basis sets. The dRPA energies for the DARC test set are extrapolated to the complete basis set limit. CCSD(T)/CBS energies were used as a reference. The standard global hybrid functionals show general improvements over the typical endothermic energy error of semilocal functionals, but despite the increased accuracy the precision of the methods increases only slightly, and thus all reaction energies are simply shifted into the exothermic direction. Dispersion corrections give mixed results for the DARC test set. Vydrov-Van Voorhis 10 correction to the reaction energies gives superior quality results compared to the too-small D3 correction. Functionals parametrized for energies of noncovalent interactions like M08-SO give reasonable results without any dispersion correction. The dRPA method that seamlessly and theoretically correctly includes noncovalent interaction energies gives excellent results with properly chosen reference orbitals. As the results for the SIE11 test set and H2(+) dissociation show that the dRPA methods suffer from delocalization error, good reaction energies for the DARC test set from a given method do not prove that the method is free from delocalization error. The RPAX2 method shows good performance for the DARC

  18. Accurate vibrational frequencies using the self-consistent-charge density-functional tight-binding method

    NASA Astrophysics Data System (ADS)

    Małolepsza, Edyta; Witek, Henryk A.; Morokuma, Keiji

    2005-09-01

    An optimization technique for enhancing the quality of repulsive two-body potentials of the self-consistent-charge density-functional tight-binding (SCC-DFTB) method is presented and tested. The new, optimized potentials allow for significant improvement of calculated harmonic vibrational frequencies. Mean absolute deviation from experiment computed for a group of 14 hydrocarbons is reduced from 59.0 to 33.2 cm -1 and maximal absolute deviation, from 436.2 to 140.4 cm -1. A drawback of the new family of potentials is a lower quality of reproduced geometrical and energetic parameters.

  19. Two functionally distinct kinetochore pools of BubR1 ensure accurate chromosome segregation

    PubMed Central

    Zhang, Gang; Mendez, Blanca Lopez; Sedgwick, Garry G.; Nilsson, Jakob

    2016-01-01

    The BubR1/Bub3 complex is an important regulator of chromosome segregation as it facilitates proper kinetochore–microtubule interactions and is also an essential component of the spindle assembly checkpoint (SAC). Whether BubR1/Bub3 localization to kinetochores in human cells stimulates SAC signalling or only contributes to kinetochore–microtubule interactions is debated. Here we show that two distinct pools of BubR1/Bub3 exist at kinetochores and we uncouple these with defined BubR1/Bub3 mutants to address their function. The major kinetochore pool of BubR1/Bub3 is dependent on direct Bub1/Bub3 binding and is required for chromosome alignment but not for the SAC. A distinct pool of BubR1/Bub3 localizes by directly binding to phosphorylated MELT repeats on the outer kinetochore protein KNL1. When we prevent the direct binding of BubR1/Bub3 to KNL1 the checkpoint is weakened because BubR1/Bub3 is not incorporated into checkpoint complexes efficiently. In conclusion, kinetochore localization supports both known functions of BubR1/Bub3. PMID:27457023

  20. Compact and accurate linear and nonlinear autoregressive moving average model parameter estimation using laguerre functions.

    PubMed

    Chon, K H; Cohen, R J; Holstein-Rathlou, N H

    1997-01-01

    A linear and nonlinear autoregressive moving average (ARMA) identification algorithm is developed for modeling time series data. The algorithm uses Laguerre expansion of kernals (LEK) to estimate Volterra-Wiener kernals. However, instead of estimating linear and nonlinear system dynamics via moving average models, as is the case for the Volterra-Wiener analysis, we propose an ARMA model-based approach. The proposed algorithm is essentially the same as LEK, but this algorithm is extended to include past values of the output as well. Thus, all of the advantages associated with using the Laguerre function remain with our algorithm; but, by extending the algorithm to the linear and nonlinear ARMA model, a significant reduction in the number of Laguerre functions can be made, compared with the Volterra-Wiener approach. This translates into a more compact system representation and makes the physiological interpretation of higher order kernels easier. Furthermore, simulation results show better performance of the proposed approach in estimating the system dynamics than LEK in certain cases, and it remains effective in the presence of significant additive measurement noise. PMID:9236985

  1. Estimation method of point spread function based on Kalman filter for accurately evaluating real optical properties of photonic crystal fibers.

    PubMed

    Shen, Yan; Lou, Shuqin; Wang, Xin

    2014-03-20

    The evaluation accuracy of real optical properties of photonic crystal fibers (PCFs) is determined by the accurate extraction of air hole edges from microscope images of cross sections of practical PCFs. A novel estimation method of point spread function (PSF) based on Kalman filter is presented to rebuild the micrograph image of the PCF cross-section and thus evaluate real optical properties for practical PCFs. Through tests on both artificially degraded images and microscope images of cross sections of practical PCFs, we prove that the proposed method can achieve more accurate PSF estimation and lower PSF variance than the traditional Bayesian estimation method, and thus also reduce the defocus effect. With this method, we rebuild the microscope images of two kinds of commercial PCFs produced by Crystal Fiber and analyze the real optical properties of these PCFs. Numerical results are in accord with the product parameters.

  2. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    PubMed Central

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and

  3. Toward an Accurate Density-Functional Tight-Binding Description of Zinc-Containing Compounds.

    PubMed

    Moreira, Ney H; Dolgonos, Grygoriy; Aradi, Bálint; da Rosa, Andreia L; Frauenheim, Thomas

    2009-03-10

    An extended self-consistent charge density-functional tight-binding (SCC-DFTB) parametrization for Zn-X (X = H, C, N, O, S, and Zn) interactions has been derived. The performance of this new parametrization has been validated by calculating the structural and energetic properties of zinc solid phases such as bulk Zn, ZnO, and ZnS; ZnO surfaces and nanostructures; adsorption of small species (H, CO2, and NH3) on ZnO surfaces; and zinc-containing complexes mimicking the biological environment. Our results show that the derived parameters are universal and fully transferable, describing all the above-mentioned systems with accuracies comparable to those of first-principles DFT results. PMID:26610226

  4. Characterization of Thin Film Materials using SCAN MetaGGA, an Accurate Nonempirical Density Functional

    NASA Astrophysics Data System (ADS)

    Buda, Ioana-Gianina; Lane, Christopher; Barbiellini, Bernardo; Ruzsinszky, Adrienn; Sun, Jianwei; Perdew, John P.; Bansil, Arun

    The exact ground-state properties of a material can be derived from the single-particle Kohn-Sham equations within the framework of the Density Functional Theory (DFT), provided the exact exchange-correlation potential is known. The simplest approximation is the local density approximation (LDA), but it usually leads to overbinding in molecules and solids. On the other hand, the generalized gradient approximation (GGA) introduces corrections that expand and soften bonds. The newly developed nonempirical SCAN (strongly-constrained and appropriately-normed) MetaGGA [Phys. Rev. Lett. 115, 036402] has been shown to be comparable in efficiency to LDA and GGA, and to significantly improve LDA and the Perdew-Burke-Ernzerhof version of the GGA for ground-state properties such as equilibrium geometry and lattice constants for a number of standard datasets for molecules and solids. Here we discuss the performance of SCAN MetaGGA for thin films and monolayers and demonstrate improvements of predicted ground-state properties. Examples include graphene, phosphorene and MoS2.

  5. Efficient Approximation of Head-Related Transfer Functions in Subbands for Accurate Sound Localization

    PubMed Central

    Marelli, Damián; Baumgartner, Robert; Majdak, Piotr

    2015-01-01

    Head-related transfer functions (HRTFs) describe the acoustic filtering of incoming sounds by the human morphology and are essential for listeners to localize sound sources in virtual auditory displays. Since rendering complex virtual scenes is computationally demanding, we propose four algorithms for efficiently representing HRTFs in subbands, i.e., as an analysis filterbank (FB) followed by a transfer matrix and a synthesis FB. All four algorithms use sparse approximation procedures to minimize the computational complexity while maintaining perceptually relevant HRTF properties. The first two algorithms separately optimize the complexity of the transfer matrix associated to each HRTF for fixed FBs. The other two algorithms jointly optimize the FBs and transfer matrices for complete HRTF sets by two variants. The first variant aims at minimizing the complexity of the transfer matrices, while the second one does it for the FBs. Numerical experiments investigate the latency-complexity trade-off and show that the proposed methods offer significant computational savings when compared with other available approaches. Psychoacoustic localization experiments were modeled and conducted to find a reasonable approximation tolerance so that no significant localization performance degradation was introduced by the subband representation. PMID:26681930

  6. Accurate alignment of functional EPI data to anatomical MRI using a physics-based distortion model.

    PubMed

    Studholme, C; Constable, R T; Duncan, J S

    2000-11-01

    Mapping of functional magnetic resonance imaging (fMRI) to conventional anatomical MRI is a valuable step in the interpretation of fMRI activations. One of the main limits on the accuracy of this alignment arises from differences in the geometric distortion induced by magnetic field inhomogeneity. This paper describes an approach to the registration of echo planar image (EPI) data to conventional anatomical images which takes into account this difference in geometric distortion. We make use of an additional spin echo EPI image and use the known signal conservation in spin echo distortion to derive a specialized multimodality nonrigid registration algorithm. We also examine a plausible modification using log-intensity evaluation of the criterion to provide increased sensitivity in areas of low EPI signal. A phantom-based imaging experiment is used to evaluate the behavior of the different criteria, comparing nonrigid displacement estimates to those provided by a imagnetic field mapping acquisition. The algorithm is then applied to a range of nine brain imaging studies illustrating global and local improvement in the anatomical alignment and localization of fMRI activations.

  7. Accurate predictions of C-SO2R bond dissociation enthalpies using density functional theory methods.

    PubMed

    Yu, Hai-Zhu; Fu, Fang; Zhang, Liang; Fu, Yao; Dang, Zhi-Min; Shi, Jing

    2014-10-14

    The dissociation of the C-SO2R bond is frequently involved in organic and bio-organic reactions, and the C-SO2R bond dissociation enthalpies (BDEs) are potentially important for understanding the related mechanisms. The primary goal of the present study is to provide a reliable calculation method to predict the different C-SO2R bond dissociation enthalpies (BDEs). Comparing the accuracies of 13 different density functional theory (DFT) methods (such as B3LYP, TPSS, and M05 etc.), and different basis sets (such as 6-31G(d) and 6-311++G(2df,2p)), we found that M06-2X/6-31G(d) gives the best performance in reproducing the various C-S BDEs (and especially the C-SO2R BDEs). As an example for understanding the mechanisms with the aid of C-SO2R BDEs, some primary mechanistic studies were carried out on the chemoselective coupling (in the presence of a Cu-catalyst) or desulfinative coupling reactions (in the presence of a Pd-catalyst) between sulfinic acid salts and boryl/sulfinic acid salts.

  8. Towards an accurate specific reaction parameter density functional for water dissociation on Ni(111): RPBE versus PW91.

    PubMed

    Jiang, Bin; Guo, Hua

    2016-08-01

    In search for an accurate description of the dissociative chemisorption of water on the Ni(111) surface, we report a new nine-dimensional potential energy surface (PES) based on a large number of density functional theory points using the RPBE functional. Seven-dimensional quantum dynamical calculations have been carried out on the RPBE PES, followed by site averaging and lattice effect corrections, yielding sticking probabilities that are compared with both the previous theoretical results based on a PW91 PES and experiment. It is shown that the RPBE functional increases the reaction barrier, but has otherwise a minor impact on the PES topography. Better agreement with experimental results is obtained with the new PES, but the agreement is still not quantitative. Possible sources of the remaining discrepancies are discussed.

  9. Galileo Reader and Annotator

    NASA Astrophysics Data System (ADS)

    Besomi, O.

    2011-06-01

    In his readings, Galileo made frequent use of annotations. Here, I will offer a general glance at them by discussing the case of the annotations to the Libra astronomica published in 1619 by Orazio Grassi, a Jesuit mathematician of the Collegio Romano. The annotations directly reflect Galileo's reaction to Grassi's book in a heated debate between the two astronomers. Galileo and Grassi had opposite ideas about the nature of the comets, which resulted in different scientific and theological implications. The annotations represent the starting point for Galileo's reply to the Libra, namely Il Saggiatore, which was published four years later and dedicated to the new pope Urban VIII.

  10. Annotation of primate miRNAs by high throughput sequencing of small RNA libraries

    PubMed Central

    2012-01-01

    Background In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates. Results We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here. Conclusions In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future. PMID:22453055

  11. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    PubMed Central

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-01-01

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception. PMID:25666585

  12. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    SciTech Connect

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, III, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  13. Dictionary-driven protein annotation.

    PubMed

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  14. Sexually Transmitted Diseases: A Selective, Annotated Bibliography.

    ERIC Educational Resources Information Center

    Planned Parenthood Federation of America, Inc., New York, NY. Education Dept.

    This document contains a reference sheet and an annotated bibliography concerned with sexually transmitted diseases (STD). The reference sheet provides a brief, accurate overview of STDs which includes both statistical and background information. The bibliography contains 83 entries, listed alphabetically, that deal with STDs. Books and articles…

  15. Function Annotation of an SBP-box Gene in Arabidopsis Based on Analysis of Co-expression Networks and Promoters

    PubMed Central

    Wang, Yi; Hu, Zongli; Yang, Yuxin; Chen, Xuqing; Chen, Guoping

    2009-01-01

    The SQUAMOSA PROMOTER BINDING PROTEIN–LIKE (SPL) gene family is an SBP-box transcription family in Arabidopsis. While several physiological responses to SPL genes have been reported, their biological role remains elusive. Here, we use a combined analysis of expression correlation, the interactome, and promoter content to infer the biological role of the SPL genes in Arabidopsis thaliana. Analysis of the SPL-correlated gene network reveals multiple functions for SPL genes. Network analysis shows that SPL genes function by controlling other transcription factor families and have relatives with membrane protein transport activity. The interactome analysis of the correlation genes suggests that SPL genes also take part in metabolism of glucose, inorganic salts, and ATP production. Furthermore, the promoters of the correlated genes contain a core binding cis-element (GTAC). All of these analyses suggest that SPL genes have varied functions in Arabidopsis. PMID:19333437

  16. Annotating user-defined abstractions for optimization

    SciTech Connect

    Quinlan, D; Schordan, M; Vuduc, R; Yi, Q

    2005-12-05

    This paper discusses the features of an annotation language that we believe to be essential for optimizing user-defined abstractions. These features should capture semantics of function, data, and object-oriented abstractions, express abstraction equivalence (e.g., a class represents an array abstraction), and permit extension of traditional compiler optimizations to user-defined abstractions. Our future work will include developing a comprehensive annotation language for describing the semantics of general object-oriented abstractions, as well as automatically verifying and inferring the annotated semantics.

  17. An accurate metalloprotein-specific scoring function and molecular docking program devised by a dynamic sampling and iteration optimization strategy.

    PubMed

    Bai, Fang; Liao, Sha; Gu, Junfeng; Jiang, Hualiang; Wang, Xicheng; Li, Honglin

    2015-04-27

    Metalloproteins, particularly zinc metalloproteins, are promising therapeutic targets, and recent efforts have focused on the identification of potent and selective inhibitors of these proteins. However, the ability of current drug discovery and design technologies, such as molecular docking and molecular dynamics simulations, to probe metal-ligand interactions remains limited because of their complicated coordination geometries and rough treatment in current force fields. Herein we introduce a robust, multiobjective optimization algorithm-driven metalloprotein-specific docking program named MpSDock, which runs on a scheme similar to consensus scoring consisting of a force-field-based scoring function and a knowledge-based scoring function. For this purpose, in this study, an effective knowledge-based zinc metalloprotein-specific scoring function based on the inverse Boltzmann law was designed and optimized using a dynamic sampling and iteration optimization strategy. This optimization strategy can dynamically sample and regenerate decoy poses used in each iteration step of refining the scoring function, thus dramatically improving both the effectiveness of the exploration of the binding conformational space and the sensitivity of the ranking of the native binding poses. To validate the zinc metalloprotein-specific scoring function and its special built-in docking program, denoted MpSDockZn, an extensive comparison was performed against six universal, popular docking programs: Glide XP mode, Glide SP mode, Gold, AutoDock, AutoDock4Zn, and EADock DSS. The zinc metalloprotein-specific knowledge-based scoring function exhibited prominent performance in accurately describing the geometries and interactions of the coordination bonds between the zinc ions and chelating agents of the ligands. In addition, MpSDockZn had a competitive ability to sample and identify native binding poses with a higher success rate than the other six docking programs.

  18. Improved image quality in pinhole SPECT by accurate modeling of the point spread function in low magnification systems

    SciTech Connect

    Pino, Francisco; Roé, Nuria; Aguiar, Pablo; Falcon, Carles; Ros, Domènec; Pavía, Javier

    2015-02-15

    Purpose: Single photon emission computed tomography (SPECT) has become an important noninvasive imaging technique in small-animal research. Due to the high resolution required in small-animal SPECT systems, the spatially variant system response needs to be included in the reconstruction algorithm. Accurate modeling of the system response should result in a major improvement in the quality of reconstructed images. The aim of this study was to quantitatively assess the impact that an accurate modeling of spatially variant collimator/detector response has on image-quality parameters, using a low magnification SPECT system equipped with a pinhole collimator and a small gamma camera. Methods: Three methods were used to model the point spread function (PSF). For the first, only the geometrical pinhole aperture was included in the PSF. For the second, the septal penetration through the pinhole collimator was added. In the third method, the measured intrinsic detector response was incorporated. Tomographic spatial resolution was evaluated and contrast, recovery coefficients, contrast-to-noise ratio, and noise were quantified using a custom-built NEMA NU 4–2008 image-quality phantom. Results: A high correlation was found between the experimental data corresponding to intrinsic detector response and the fitted values obtained by means of an asymmetric Gaussian distribution. For all PSF models, resolution improved as the distance from the point source to the center of the field of view increased and when the acquisition radius diminished. An improvement of resolution was observed after a minimum of five iterations when the PSF modeling included more corrections. Contrast, recovery coefficients, and contrast-to-noise ratio were better for the same level of noise in the image when more accurate models were included. Ring-type artifacts were observed when the number of iterations exceeded 12. Conclusions: Accurate modeling of the PSF improves resolution, contrast, and recovery

  19. A novel analytical brain block tool to enable functional annotation of discriminatory transcript biomarkers among discrete regions of the fronto-limbic circuit in primate brain.

    PubMed

    Dalgard, Clifton L; Jacobowitz, David M; Singh, Vijay K; Saleem, Kadharbatcha S; Ursano, Robert J; Starr, Joshua M; Pollard, Harvey B

    2015-03-10

    Fronto-limbic circuits in the primate brain are responsible for executive function, learning and memory, and emotions, including fear. Consequently, changes in gene expression in cortical and subcortical brain regions housing these circuits are associated with many important psychiatric and neurological disorders. While high quality gene expression profiles can be identified in brains from model organisms, primate brains have unique features such as Brodmann Area 25, which is absent in rodents, yet profoundly important in primates, including humans. The potential insights to be gained from studying the human brain are complicated by the fact that the post-mortem interval (PMI) is variable, and most repositories keep solid tissue in the deep frozen state. Consequently, sampling the important medial and internal regions of these brains is difficult. Here we describe a novel method for obtaining discrete regions from the fronto-limbic circuits of a 4 year old and a 5 year old, male, intact, frozen non-human primate (NHP) brain, for which the PMI is exactly known. The method also preserves high quality RNA, from which we use transcriptional profiling and a new algorithm to identify region-exclusive RNA signatures for Area 25 (NFκB and dopamine receptor signaling), the anterior cingulate cortex (LXR/RXR signaling), the amygdala (semaphorin signaling), and the hippocampus (Ca(++) and retinoic acid signaling). The RNA signatures not only reflect function of the different regions, but also include highly expressed RNAs for which function is either poorly understood, or which generate proteins presently lacking annotated functions. We suggest that this new approach will provide a useful strategy for identifying changes in fronto-limbic system biology underlying normal development, aging and disease in the human brain. PMID:25529630

  20. Accurate potential energy curve of the LiH{sup +} molecule calculated with explicitly correlated Gaussian functions

    SciTech Connect

    Tung, Wei-Cheng; Adamowicz, Ludwik

    2014-03-28

    Very accurate calculations of the ground-state potential energy curve (PEC) of the LiH{sup +} ion performed with all-electron explicitly correlated Gaussian functions with shifted centers are presented. The variational method is employed. The calculations involve optimization of nonlinear exponential parameters of the Gaussians performed with the aid of the analytical first derivatives of the energy determined with respect to the parameters. The diagonal adiabatic correction is also calculated for each PEC point. The PEC is then used to calculate the vibrational energies of the system. In that calculation, the non-adiabatic effects are accounted for by using an effective vibrational mass obtained by the minimization of the difference between the vibrational energies obtained from the calculations where the Born-Oppenheimer approximation was not assumed and the results of the present calculations.

  1. Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis.

    PubMed

    Lees, Jonathan; Yeats, Corin; Perkins, James; Sillitoe, Ian; Rentzsch, Robert; Dessailly, Benoit H; Orengo, Christine

    2012-01-01

    Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein-protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.

  2. The Use of Annotations in Examination Marking: Opening a Window into Markers' Minds

    ERIC Educational Resources Information Center

    Crisp, Victoria; Johnson, Martin

    2007-01-01

    This study investigated the functions of annotations, the role of annotations in markers' decision-making processes, whether annotations conform to conventions, and whether these vary according to subject area. Across subjects a number of scripts were analysed to survey which annotations are subject specific and which are more general. Twelve…

  3. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

    PubMed Central

    Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered

  4. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (∼3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  5. Making web annotations persistent over time

    SciTech Connect

    Sanderson, Robert; Van De Sompel, Herbert

    2010-01-01

    As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations of web resources changing over time, an annotation made about a web resource today may no longer be relevant to the representation that is served from that same resource tomorrow. We assume the existence of archived versions of resources, and combine the temporal features of the emerging Open Annotation data model with the capability offered by the Memento framework that allows seamless navigation from the URI of a resource to archived versions of that resource, and arrive at a solution that provides guarantees regarding the persistence of web annotations over time. More specifically, we provide theoretical solutions and proof-of-concept experimental evaluations for two problems: reconstructing an existing annotation so that the correct archived version is displayed for all resources involved in the annotation, and retrieving all annotations that involve a given archived version of a web resource.

  6. BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments.

    PubMed

    Al-Shahrour, Fátima; Minguez, Pablo; Vaquerizas, Juan M; Conde, Lucía; Dopazo, Joaquín

    2005-07-01

    We present Babelomics, a complete suite of web tools for the functional analysis of groups of genes in high-throughput experiments, which includes the use of information on Gene Ontology terms, interpro motifs, KEGG pathways, Swiss-Prot keywords, analysis of predicted transcription factor binding sites, chromosomal positions and presence in tissues with determined histological characteristics, through five integrated modules: FatiGO (fast assignment and transference of information), FatiWise, transcription factor association test, GenomeGO and tissues mining tool, respectively. Additionally, another module, FatiScan, provides a new procedure that integrates biological information in combination with experimental results in order to find groups of genes with modest but coordinate significant differential behaviour. FatiScan is highly sensitive and is capable of finding significant asymmetries in the distribution of genes of common function across a list of ordered genes even if these asymmetries were not extreme. The strong multiple-testing nature of the contrasts made by the tools is taken into account. All the tools are integrated in the gene expression analysis package GEPAS. Babelomics is the natural evolution of our tool FatiGO (which analysed almost 22,000 experiments during the last year) to include more sources on information and new modes of using it. Babelomics can be found at http://www.babelomics.org.

  7. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/.

  8. The index of orthognathic functional treatment need accurately prioritises those patients already selected for orthognathic surgery within the NHS.

    PubMed

    Shah, Rupal; Breeze, John; Chand, Mohit; Stockton, Peter

    2016-06-01

    The index of orthognathic functional treatment need (IOFTN) is a newly-proposed system to help to prioritise patients for orthognathic treatment. The five categories are similar to those used in orthodontics, but include additional parameters such as sleep apnoea and facial asymmetry. The aim of this audit was to validate the index and find out the potential future implications, should such a system ever be adopted by commissioners. We calculated the IOFTN category of 100 consecutive patients who had orthognathic surgery between 2010-14 using clinical notes, photographs, study models, and radiographs, and determined the number in categories 4 or 5, analogous to the current indications for orthodontic treatment within the NHS. Sufficient clinical information was available to categorise 59/100 patients, and 56 of the 59 (95%) were in either category 4 or 5. All three of the remaining patients (in categories 1-3) who were operated on were treated because of the anticipated favourable impact on their quality of life. The IOFTN has been proposed for use in future commissioning of orthognathic services within the NHS, and this study has confirmed its efficacy in prioritising treatment accurately, with 95% of patients being in categories 4 or 5. We recommend that the orthognathic treatment index be adapted to include additional psychosocial assessment so that patients who fall into the lower functional categories are not automatically excluded from this potentially life-changing treatment.

  9. Integration of high-content screening and untargeted metabolomics for comprehensive functional annotation of natural product libraries

    PubMed Central

    Kurita, Kenji L.; Glassey, Emerson; Linington, Roger G.

    2015-01-01

    Traditional natural products discovery using a combination of live/dead screening followed by iterative bioassay-guided fractionation affords no information about compound structure or mode of action until late in the discovery process. This leads to high rates of rediscovery and low probabilities of finding compounds with unique biological and/or chemical properties. By integrating image-based phenotypic screening in HeLa cells with high-resolution untargeted metabolomics analysis, we have developed a new platform, termed Compound Activity Mapping, that is capable of directly predicting the identities and modes of action of bioactive constituents for any complex natural product extract library. This new tool can be used to rapidly identify novel bioactive constituents and provide predictions of compound modes of action directly from primary screening data. This approach inverts the natural products discovery process from the existing ‟grind and find” model to a targeted, hypothesis-driven discovery model where the chemical features and biological function of bioactive metabolites are known early in the screening workflow, and lead compounds can be rationally selected based on biological and/or chemical novelty. We demonstrate the utility of the Compound Activity Mapping platform by combining 10,977 mass spectral features and 58,032 biological measurements from a library of 234 natural products extracts and integrating these two datasets to identify 13 clusters of fractions containing 11 known compound families and four new compounds. Using Compound Activity Mapping we discovered the quinocinnolinomycins, a new family of natural products with a unique carbon skeleton that cause endoplasmic reticulum stress. PMID:26371303

  10. Conservation, sex-biased expression and functional annotation of microRNAs in the gonad of Amur sturgeon (Acipenser schrenckii).

    PubMed

    Zhang, Xiujuan; Yuan, Lihong; Li, Linmiao; Jiang, Haiying; Chen, Jinping

    2016-06-01

    MicroRNAs (miRNAs) are involved in post-transcriptional gene regulation and have crucial roles in regulating the expression of gametogenesis-related genes in animals. However, the mechanism of sex determination and differentiation in the sturgeon has remained unclear. Identifying miRNAs and characterizing sex-biased miRNA expression is therefore critical for understanding the role of miRNAs during sexual differentiation in sturgeon. In this study, five different tissues from sturgeon before sex differentiation and the gonads were used for miRNA expression profiling. We screened 1037 miRNAs in miRBase 20.0 and an additional 103 sturgeon miRNAs using microarray and real-time PCR. We found that the sequences of 477 miRNAs out of a total of 1140 miRNAs were highly conserved (100%) among different fish species. From a total of 663 non-redundant miRNA probes, 481 miRNAs were detected in the gonads of both sexes. Of the 148 miRNAs that were identified to have sex-biased expression patterns between the testis and the ovary (P<0.01), 21 miRNAs (14.19%) were relatively highly expressed in the testis or the ovary with fold-changes >2. The microarray expression patterns of 13 randomly selected sex-biased miRNAs were validated using real-time PCR. Target gene prediction revealed a significant enrichment of functional groups (88 GO terms) and 18 KEGG pathways (P<0.05) and suggested that there are interactions between sex-biased miRNAs and 25 putative gametogenesis-related targets. Therefore, our miRNA expression analysis in juvenile A. schrenckii establishes a foundation for understanding and further investigating the role of miRNAs in sturgeon sex differentiation. PMID:27089517

  11. Conservation, sex-biased expression and functional annotation of microRNAs in the gonad of Amur sturgeon (Acipenser schrenckii).

    PubMed

    Zhang, Xiujuan; Yuan, Lihong; Li, Linmiao; Jiang, Haiying; Chen, Jinping

    2016-06-01

    MicroRNAs (miRNAs) are involved in post-transcriptional gene regulation and have crucial roles in regulating the expression of gametogenesis-related genes in animals. However, the mechanism of sex determination and differentiation in the sturgeon has remained unclear. Identifying miRNAs and characterizing sex-biased miRNA expression is therefore critical for understanding the role of miRNAs during sexual differentiation in sturgeon. In this study, five different tissues from sturgeon before sex differentiation and the gonads were used for miRNA expression profiling. We screened 1037 miRNAs in miRBase 20.0 and an additional 103 sturgeon miRNAs using microarray and real-time PCR. We found that the sequences of 477 miRNAs out of a total of 1140 miRNAs were highly conserved (100%) among different fish species. From a total of 663 non-redundant miRNA probes, 481 miRNAs were detected in the gonads of both sexes. Of the 148 miRNAs that were identified to have sex-biased expression patterns between the testis and the ovary (P<0.01), 21 miRNAs (14.19%) were relatively highly expressed in the testis or the ovary with fold-changes >2. The microarray expression patterns of 13 randomly selected sex-biased miRNAs were validated using real-time PCR. Target gene prediction revealed a significant enrichment of functional groups (88 GO terms) and 18 KEGG pathways (P<0.05) and suggested that there are interactions between sex-biased miRNAs and 25 putative gametogenesis-related targets. Therefore, our miRNA expression analysis in juvenile A. schrenckii establishes a foundation for understanding and further investigating the role of miRNAs in sturgeon sex differentiation.

  12. Accurate PSF-matched photometry and photometric redshifts for the extreme deep field with the Chebyshev-Fourier functions

    NASA Astrophysics Data System (ADS)

    Jiménez-Teja, Y.; Benítez, N.; Molino, A.; Fernandes, C. A. C.

    2015-10-01

    Photometric redshifts, which have become the cornerstone of several of the largest astronomical surveys like PanStarrs, DES, J-PAS and LSST, require precise measurements of galaxy photometry in different bands using a consistent physical aperture. This is not trivial, due to the variation in the shape and width of the point spread function (PSF) introduced by wavelength differences, instrument positions and atmospheric conditions. Current methods to correct for this effect rely on a detailed knowledge of PSF characteristics as a function of the survey coordinates, which can be difficult due to the relative paucity of stars tracking the PSF behaviour. Here we show that it is possible to measure accurate, consistent multicolour photometry without knowing the shape of the PSF. The Chebyshev-Fourier functions (CHEFs) can fit the observed profile of each object and produce high signal-to-noise integrated flux measurements unaffected by the PSF. These total fluxes, which encompass all the galaxy populations, are much more useful for galaxy evolution studies than aperture photometry. We compare the total magnitudes and colours obtained using our software to traditional photometry with SEXTRACTOR, using real data from the COSMOS survey and the Hubble Ultra-Deep Field (HUDF). We also apply the CHEF technique to the recently published eXtreme Deep Field (XDF) and compare the results to those from COLORPRO on the HUDF. We produce a photometric catalogue with 35 732 sources (10 823 with signal-to-noise ratio ≥5), reaching a photometric redshift precision of 2 per cent due to the extraordinary depth and wavelength coverage of the eXtreme Deep Field images.

  13. MixtureTree annotator: a program for automatic colorization and visual annotation of MixtureTree.

    PubMed

    Chen, Shu-Chuan; Ogata, Aaron

    2015-01-01

    The MixtureTree Annotator, written in JAVA, allows the user to automatically color any phylogenetic tree in Newick format generated from any phylogeny reconstruction program and output the Nexus file. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides a unique advantage over any other programs which perform a similar function. In addition, the MixtureTree Annotator is the only package that can efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. In order to visualize the resulting output file, a modified version of FigTree is used. Certain popular methods, which lack good built-in visualization tools, for example, MEGA, Mesquite, PHY-FI, TreeView, treeGraph and Geneious, may give results with human errors due to either manually adding colors to each node or with other limitations, for example only using color based on a number, such as branch length, or by taxonomy. In addition to allowing the user to automatically color any given Newick tree by sequence name, the MixtureTree Annotator is the only method that allows the user to automatically annotate the resulting tree created by the MixtureTree program. The MixtureTree Annotator is fast and easy-to-use, while still allowing the user full control over the coloring and annotating process. PMID:25826378

  14. MixtureTree annotator: a program for automatic colorization and visual annotation of MixtureTree.

    PubMed

    Chen, Shu-Chuan; Ogata, Aaron

    2015-01-01

    The MixtureTree Annotator, written in JAVA, allows the user to automatically color any phylogenetic tree in Newick format generated from any phylogeny reconstruction program and output the Nexus file. By providing the ability to automatically color the tree by sequence name, the MixtureTree Annotator provides a unique advantage over any other programs which perform a similar function. In addition, the MixtureTree Annotator is the only package that can efficiently annotate the output produced by MixtureTree with mutation information and coalescent time information. In order to visualize the resulting output file, a modified version of FigTree is used. Certain popular methods, which lack good built-in visualization tools, for example, MEGA, Mesquite, PHY-FI, TreeView, treeGraph and Geneious, may give results with human errors due to either manually adding colors to each node or with other limitations, for example only using color based on a number, such as branch length, or by taxonomy. In addition to allowing the user to automatically color any given Newick tree by sequence name, the MixtureTree Annotator is the only method that allows the user to automatically annotate the resulting tree created by the MixtureTree program. The MixtureTree Annotator is fast and easy-to-use, while still allowing the user full control over the coloring and annotating process.

  15. Wiki-pi: a web-server of annotated human protein-protein interactions to aid in discovery of protein function.

    PubMed

    Orii, Naoki; Ganapathiraju, Madhavi K

    2012-01-01

    Protein-protein interactions (PPIs) are the basis of biological functions. Knowledge of the interactions of a protein can help understand its molecular function and its association with different biological processes and pathways. Several publicly available databases provide comprehensive information about individual proteins, such as their sequence, structure, and function. There also exist databases that are built exclusively to provide PPIs by curating them from published literature. The information provided in these web resources is protein-centric, and not PPI-centric. The PPIs are typically provided as lists of interactions of a given gene with links to interacting partners; they do not present a comprehensive view of the nature of both the proteins involved in the interactions. A web database that allows search and retrieval based on biomedical characteristics of PPIs is lacking, and is needed. We present Wiki-Pi (read Wiki-π), a web-based interface to a database of human PPIs, which allows users to retrieve interactions by their biomedical attributes such as their association to diseases, pathways, drugs and biological functions. Each retrieved PPI is shown with annotations of both of the participant proteins side-by-side, creating a basis to hypothesize the biological function facilitated by the interaction. Conceptually, it is a search engine for PPIs analogous to PubMed for scientific literature. Its usefulness in generating novel scientific hypotheses is demonstrated through the study of IGSF21, a little-known gene that was recently identified to be associated with diabetic retinopathy. Using Wiki-Pi, we infer that its association to diabetic retinopathy may be mediated through its interactions with the genes HSPB1, KRAS, TMSB4X and DGKD, and that it may be involved in cellular response to external stimuli, cytoskeletal organization and regulation of molecular activity. The website also provides a wiki-like capability allowing users to describe or

  16. Wiki-Pi: A Web-Server of Annotated Human Protein-Protein Interactions to Aid in Discovery of Protein Function

    PubMed Central

    Orii, Naoki; Ganapathiraju, Madhavi K.

    2012-01-01

    Protein-protein interactions (PPIs) are the basis of biological functions. Knowledge of the interactions of a protein can help understand its molecular function and its association with different biological processes and pathways. Several publicly available databases provide comprehensive information about individual proteins, such as their sequence, structure, and function. There also exist databases that are built exclusively to provide PPIs by curating them from published literature. The information provided in these web resources is protein-centric, and not PPI-centric. The PPIs are typically provided as lists of interactions of a given gene with links to interacting partners; they do not present a comprehensive view of the nature of both the proteins involved in the interactions. A web database that allows search and retrieval based on biomedical characteristics of PPIs is lacking, and is needed. We present Wiki-Pi (read Wiki-π), a web-based interface to a database of human PPIs, which allows users to retrieve interactions by their biomedical attributes such as their association to diseases, pathways, drugs and biological functions. Each retrieved PPI is shown with annotations of both of the participant proteins side-by-side, creating a basis to hypothesize the biological function facilitated by the interaction. Conceptually, it is a search engine for PPIs analogous to PubMed for scientific literature. Its usefulness in generating novel scientific hypotheses is demonstrated through the study of IGSF21, a little-known gene that was recently identified to be associated with diabetic retinopathy. Using Wiki-Pi, we infer that its association to diabetic retinopathy may be mediated through its interactions with the genes HSPB1, KRAS, TMSB4X and DGKD, and that it may be involved in cellular response to external stimuli, cytoskeletal organization and regulation of molecular activity. The website also provides a wiki-like capability allowing users to describe or

  17. Accurate and Efficient Calculation of van der Waals Interactions Within Density Functional Theory by Local Atomic Potential Approach

    SciTech Connect

    Sun, Y. Y.; Kim, Y. H.; Lee, K.; Zhang, S. B.

    2008-01-01

    Density functional theory (DFT) in the commonly used local density or generalized gradient approximation fails to describe van der Waals (vdW) interactions that are vital to organic, biological, and other molecular systems. Here, we propose a simple, efficient, yet accurate local atomic potential (LAP) approach, named DFT+LAP, for including vdW interactions in the framework of DFT. The LAPs for H, C, N, and O are generated by fitting the DFT+LAP potential energy curves of small molecule dimers to those obtained from coupled cluster calculations with single, double, and perturbatively treated triple excitations, CCSD(T). Excellent transferability of the LAPs is demonstrated by remarkable agreement with the JSCH-2005 benchmark database [P. Jurecka et al. Phys. Chem. Chem. Phys. 8, 1985 (2006)], which provides the interaction energies of CCSD(T) quality for 165 vdW and hydrogen-bonded complexes. For over 100 vdW dominant complexes in this database, our DFT+LAP calculations give a mean absolute deviation from the benchmark results less than 0.5 kcal/mol. The DFT+LAP approach involves no extra computational cost other than standard DFT calculations and no modification of existing DFT codes, which enables straightforward quantum simulations, such as ab initio molecular dynamics, on biomolecular systems, as well as on other organic systems.

  18. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  19. VARIANT: Command Line, Web service and Web interface for fast and accurate functional characterization of variants found by Next-Generation Sequencing

    PubMed Central

    Medina, Ignacio; De Maria, Alejandro; Bleda, Marta; Salavert, Francisco; Alonso, Roberto; Gonzalez, Cristina Y.; Dopazo, Joaquin

    2012-01-01

    The massive use of Next-Generation Sequencing (NGS) technologies is uncovering an unexpected amount of variability. The functional characterization of such variability, particularly in the most common form of variation found, the Single Nucleotide Variants (SNVs), has become a priority that needs to be addressed in a systematic way. VARIANT (VARIant ANalyis Tool) reports information on the variants found that include consequence type and annotations taken from different databases and repositories (SNPs and variants from dbSNP and 1000 genomes, and disease-related variants from the Genome-Wide Association Study (GWAS) catalog, Online Mendelian Inheritance in Man (OMIM), Catalog of Somatic Mutations in Cancer (COSMIC) mutations, etc). VARIANT also produces a rich variety of annotations that include information on the regulatory (transcription factor or miRNA-binding sites, etc.) or structural roles, or on the selective pressures on the sites affected by the variation. This information allows extending the conventional reports beyond the coding regions and expands the knowledge on the contribution of non-coding or synonymous variants to the phenotype studied. Contrarily to other tools, VARIANT uses a remote database and operates through efficient RESTful Web Services that optimize search and transaction operations. In this way, local problems of installation, update or disk size limitations are overcome without the need of sacrifice speed (thousands of variants are processed per minute). VARIANT is available at: http://variant.bioinfo.cipf.es. PMID:22693211

  20. Correction of the Caulobacter crescentus NA1000 genome annotation.

    PubMed

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  1. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  2. DAS Writeback: A Collaborative Annotation System

    PubMed Central

    2011-01-01

    Background Centralised resources such as GenBank and UniProt are perfect examples of the major international efforts that have been made to integrate and share biological information. However, additional data that adds value to these resources needs a simple and rapid route to public access. The Distributed Annotation System (DAS) provides an adequate environment to integrate genomic and proteomic information from multiple sources, making this information accessible to the community. DAS offers a way to distribute and access information but it does not provide domain experts with the mechanisms to participate in the curation process of the available biological entities and their annotations. Results We designed and developed a Collaborative Annotation System for proteins called DAS Writeback. DAS writeback is a protocol extension of DAS to provide the functionalities of adding, editing and deleting annotations. We implemented this new specification as extensions of both a DAS server and a DAS client. The architecture was designed with the involvement of the DAS community and it was improved after performing usability experiments emulating a real annotation task. Conclusions We demonstrate that DAS Writeback is effective, usable and will provide the appropriate environment for the creation and evolution of community protein annotation. PMID:21569281

  3. The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation

    PubMed Central

    Wu, Shirley; Liang, Mike P; Altman, Russ B

    2008-01-01

    Structural genomics efforts have led to increasing numbers of novel, uncharacterized protein structures with low sequence identity to known proteins, resulting in a growing need for structure-based function recognition tools. Our method, SeqFEATURE, robustly models protein functions described by sequence motifs using a structural representation. We built a library of models that shows good performance compared to other methods. In particular, SeqFEATURE demonstrates significant improvement over other methods when sequence and structural similarity are low. PMID:18197987

  4. Accurate dipole polarizabilities for water clusters n=2-12 at the coupled-cluster level of theory and benchmarking of various density functionals.

    SciTech Connect

    Hammond, J.; Govind, N.; Kowalski, K.; Autschbach, J.; Xantheas, S.; PNNL; Univ. of Buffalo

    2009-12-07

    The static dipole polarizabilities of water clusters (2 {le} N {le} 12) are determined at the coupled-cluster level of theory (CCSD). For the dipole polarizability of the water monomer it was determined that the role of the basis set is more important than that of electron correlation and that the basis set augmentation converges with two sets of diffuse functions. The CCSD results are used to benchmark a variety of density functionals while the performance of several families of basis sets (Dunning, Pople, and Sadlej) in producing accurate values for the polarizabilities was also examined. The Sadlej family of basis sets was found to produce accurate results when compared to the ones obtained with the much larger Dunning basis sets. It was furthermore determined that the PBE0 density functional with the aug-cc-pVDZ basis set produces overall remarkably accurate polarizabilities at a moderate computational cost.

  5. Annotation of Ehux ESTs

    SciTech Connect

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  6. Annotation: The Savant Syndrome

    ERIC Educational Resources Information Center

    Heaton, Pamela; Wallace, Gregory L.

    2004-01-01

    Background: Whilst interest has focused on the origin and nature of the savant syndrome for over a century, it is only within the past two decades that empirical group studies have been carried out. Methods: The following annotation briefly reviews relevant research and also attempts to address outstanding issues in this research area.…

  7. Intellectuals in China: Annotations.

    ERIC Educational Resources Information Center

    Parker, Franklin

    This annotated bibliography of 72 books, journal articles, government reports, and newspaper feature stories focuses on the changing role of intellectuals in China, primarily since the 1949 Chinese Revolution. Particular attention is given to the Hundred Flowers Movement of 1957 and the Cultural Revolution. Most of the cited works are in English,…

  8. Collaborative Movie Annotation

    NASA Astrophysics Data System (ADS)

    Zad, Damon Daylamani; Agius, Harry

    In this paper, we focus on metadata for self-created movies like those found on YouTube and Google Video, the duration of which are increasing in line with falling upload restrictions. While simple tags may have been sufficient for most purposes for traditionally very short video footage that contains a relatively small amount of semantic content, this is not the case for movies of longer duration which embody more intricate semantics. Creating metadata is a time-consuming process that takes a great deal of individual effort; however, this effort can be greatly reduced by harnessing the power of Web 2.0 communities to create, update and maintain it. Consequently, we consider the annotation of movies within Web 2.0 environments, such that users create and share that metadata collaboratively and propose an architecture for collaborative movie annotation. This architecture arises from the results of an empirical experiment where metadata creation tools, YouTube and an MPEG-7 modelling tool, were used by users to create movie metadata. The next section discusses related work in the areas of collaborative retrieval and tagging. Then, we describe the experiments that were undertaken on a sample of 50 users. Next, the results are presented which provide some insight into how users interact with existing tools and systems for annotating movies. Based on these results, the paper then develops an architecture for collaborative movie annotation.

  9. Annotated Bibliography. First Edition.

    ERIC Educational Resources Information Center

    Haring, Norris G.

    An annotated bibliography which presents approximately 300 references from 1951 to 1973 on the education of severely/profoundly handicapped persons. Citations are grouped alphabetically by author's name within the following categories: characteristics and treatment, gross motor development, sensory and motor development, physical therapy for the…

  10. Ghostwriting: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Simmons, Donald B.

    Drawn from communication journals, historical and news magazines, business and industrial magazines, political science and world affairs journals, general interest periodicals, and literary and political review magazines, the approximately 90 entries in this annotated bibliography discuss ghostwriting as practiced through the ages and reveal the…

  11. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  12. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  13. Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease.

    PubMed

    Sifrim, Alejandro; Van Houdt, Jeroen Kj; Tranchevent, Leon-Charles; Nowakowska, Beata; Sakai, Ryo; Pavlopoulos, Georgios A; Devriendt, Koen; Vermeesch, Joris R; Moreau, Yves; Aerts, Jan

    2012-01-01

    The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org.

  14. Protein Annotators' Assistant: A Novel Application of Information Retrieval Techniques.

    ERIC Educational Resources Information Center

    Wise, Michael J.

    2000-01-01

    Protein Annotators' Assistant (PAA) is a software system which assists protein annotators in assigning functions to newly sequenced proteins. PAA employs a number of information retrieval techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the…

  15. Maize - GO annotation methods, evaluation, and review (Maize-GAMER)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...

  16. Ontological Annotation with WordNet

    SciTech Connect

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob; Hohimer, Ryan E.; White, Amanda M.

    2006-06-06

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  17. Automating Ontological Annotation with WordNet

    SciTech Connect

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob L.; Hohimer, Ryan E.; White, Amanda M.

    2006-01-22

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  18. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  19. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    PubMed

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  20. Semantator: semantic annotator for converting biomedical text to linked data.

    PubMed

    Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G

    2013-10-01

    More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference.

  1. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  2. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  3. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

  4. Enzyme reaction annotation using cloud techniques.

    PubMed

    Huang, Chuan-Ching; Lin, Chun-Yuan; Chang, Cheng-Wen; Tang, Chuan Yi

    2013-01-01

    An understanding of the activities of enzymes could help to elucidate the metabolic pathways of thousands of chemical reactions that are catalyzed by enzymes in living systems. Sophisticated applications such as drug design and metabolic reconstruction could be developed using accurate enzyme reaction annotation. Because accurate enzyme reaction annotation methods create potential for enhanced production capacity in these applications, they have received greater attention in the global market. We propose the enzyme reaction prediction (ERP) method as a novel tool to deduce enzyme reactions from domain architecture. We used several frequency relationships between architectures and reactions to enhance the annotation rates for single and multiple catalyzed reactions. The deluge of information which arose from high-throughput techniques in the postgenomic era has improved our understanding of biological data, although it presents obstacles in the data-processing stage. The high computational capacity provided by cloud computing has resulted in an exponential growth in the volume of incoming data. Cloud services also relieve the requirement for large-scale memory space required by this approach to analyze enzyme kinetic data. Our tool is designed as a single execution file; thus, it could be applied to any cloud platform in which multiple queries are supported.

  5. XomAnnotate: Analysis of Heterogeneous and Complex Exome- A Step towards Translational Medicine

    PubMed Central

    Talukder, Asoke K.; Ravishankar, Shashidhar; Sasmal, Krittika; Gandham, Santhosh; Prabhukumar, Jyothsna; Achutharao, Prahalad H.; Barh, Debmalya; Blasi, Francesco

    2015-01-01

    In translational cancer medicine, implicated pathways and the relevant master genes are of focus. Exome's specificity, processing-time, and cost advantage makes it a compelling tool for this purpose. However, analysis of exome lacks reliable combinatory analysis tools and techniques. In this paper we present XomAnnotate – a meta- and functional-analysis software for exome. We compared UnifiedGenotyper, Freebayes, Delly, and Lumpy algorithms that were designed for whole-genome and combined their strengths in XomAnnotate for exome data through meta-analysis to identify comprehensive mutation profile (SNPs/SNVs, short inserts/deletes, and SVs) of patients. The mutation profile is annotated followed by functional analysis through pathway enrichment and network analysis to identify most critical genes and pathways implicated in the disease genesis. The efficacy of the software is verified through MDS and clustering and tested with available 11 familial non-BRCA1/BRCA2 breast cancer exome data. The results showed that the most significantly affected pathways across all samples are cell communication and antigen processing and presentation. ESCO1, HYAL1, RAF1 and PRKCA emerged as the key genes. Network analysis further showed the purine and propanotate metabolism pathways along with RAF1 and PRKCA genes to be master regulators in these patients. Therefore, XomAnnotate is able to use exome data to identify entire mutation landscape, pathways, and the master genes accurately with wide concordance from earlier microarray and whole-genome studies -- making it a suitable biomedical software for using exome in next-generation translational medicine. Availability http://www.iomics.in/research/XomAnnotate PMID:25905921

  6. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

    PubMed Central

    2015-01-01

    Background A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and

  7. Evaluating techniques for metagenome annotation using simulated sequence data.

    PubMed

    Randle-Boggis, Richard J; Helgason, Thorunn; Sapp, Melanie; Ashton, Peter D

    2016-07-01

    The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is to identify from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies. PMID:27162180

  8. Evaluating techniques for metagenome annotation using simulated sequence data

    PubMed Central

    Randle-Boggis, Richard J.; Helgason, Thorunn; Sapp, Melanie; Ashton, Peter D.

    2016-01-01

    The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is to identify from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies. PMID:27162180

  9. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs.

    PubMed

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M

    2013-04-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci.

  10. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs.

    PubMed

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M

    2013-04-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621

  11. LETTER TO THE EDITOR: Accurate Hylleraas-like functions for the He atom with correct cusp conditions

    NASA Astrophysics Data System (ADS)

    Rodriguez, K. V.; Gasaneo, G.

    2005-08-01

    In this letter, a set of ground state wavefunctions for the He atom is given. The functions are constructed in terms of exponential and power series as similar as possible to the Hylleraas functions of Chandrasekhar and Herzberg (1955 Phys. Rev. 98 1050). The accuracy of the calculated energies is found to be about 10-4 au and all the cusp conditions at the Coulomb singularities are satisfied. The nine-parameter functions proposed here are found to have better local energy than those given by the 6 and 14 terms Hylleraas functions of Chandrasekhar. The mean value of various functions evaluated with the different proposals shows their good quality. These properties highly qualify the function to be used as an alternative to the Chandrasekhar functions in collisional problems. The whole set of functions given here can be considered as an alternative to the proposals of Chandrasekhar (1955 Phys. Rev. 98 1050), Bonham and Kohl (1966 J. Chem. Phys. 45 2471) and Le Sech (1997 J. Phys. B: At. Mol. Opt. Phys. 30 L47).

  12. An ontology-based annotation of cardiac implantable electronic devices to detect therapy changes in a national registry.

    PubMed

    Rosier, Arnaud; Mabo, Philippe; Chauvin, Michel; Burgun, Anita

    2015-05-01

    The patient population benefitting from cardiac implantable electronic devices (CIEDs) is increasing. This study introduces a device annotation method that supports the consistent description of the functional attributes of cardiac devices and evaluates how this method can detect device changes from a CIED registry. We designed the Cardiac Device Ontology, an ontology of CIEDs and device functions. We annotated 146 cardiac devices with this ontology and used it to detect therapy changes with respect to atrioventricular pacing, cardiac resynchronization therapy, and defibrillation capability in a French national registry of patients with implants (STIDEFIX). We then analyzed a set of 6905 device replacements from the STIDEFIX registry. Ontology-based identification of therapy changes (upgraded, downgraded, or similar) was accurate (6905 cases) and performed better than straightforward analysis of the registry codes (F-measure 1.00 versus 0.75 to 0.97). This study demonstrates the feasibility and effectiveness of ontology-based functional annotation of devices in the cardiac domain. Such annotation allowed a better description and in-depth analysis of STIDEFIX. This method was useful for the automatic detection of therapy changes and may be reused for analyzing data from other device registries.

  13. The Ensembl gene annotation system.

    PubMed

    Aken, Bronwen L; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J; Murphy, Daniel N; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul; Searle, Stephen M J

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html. PMID:27337980

  14. The Ensembl gene annotation system

    PubMed Central

    Aken, Bronwen L.; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J.; Murphy, Daniel N.; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y. Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html PMID:27337980

  15. Toward Accurate Reaction Energetics for Molecular Line Growth at Surface: Quantum Monte Carlo and Density Functional Theory Calculations

    SciTech Connect

    Kanai, Y; Takeuchi, N

    2009-10-14

    We revisit the molecular line growth mechanism of styrene on the hydrogenated Si(001) 2x1 surface. In particular, we investigate the energetics of the radical chain reaction mechanism by means of diffusion quantum Monte Carlo (QMC) and density functional theory (DFT) calculations. For the exchange correlation (XC) functional we use the non-empirical generalized-gradient approximation (GGA) and meta-GGA. We find that the QMC result also predicts the intra dimer-row growth of the molecular line over the inter dimer-row growth, supporting the conclusion based on DFT results. However, the absolute magnitudes of the adsorption and reaction energies, and the heights of the energy barriers differ considerably between the QMC and DFT with the GGA/meta-GGA XC functionals.

  16. On the use of spring baseflow recession for a more accurate parameterization of aquifer transit time distribution functions

    NASA Astrophysics Data System (ADS)

    Farlin, J.; Maloszewski, P.

    2013-05-01

    Baseflow recession analysis and groundwater dating have up to now developed as two distinct branches of hydrogeology and have been used to solve entirely different problems. We show that by combining two classical models, namely the Boussinesq equation describing spring baseflow recession, and the exponential piston-flow model used in groundwater dating studies, the parameters describing the transit time distribution of an aquifer can be in some cases estimated to a far more accurate degree than with the latter alone. Under the assumption that the aquifer basis is sub-horizontal, the mean transit time of water in the saturated zone can be estimated from spring baseflow recession. This provides an independent estimate of groundwater transit time that can refine those obtained from tritium measurements. The approach is illustrated in a case study predicting atrazine concentration trend in a series of springs draining the fractured-rock aquifer known as the Luxembourg Sandstone. A transport model calibrated on tritium measurements alone predicted different times to trend reversal following the nationwide ban on atrazine in 2005 with different rates of decrease. For some of the springs, the actual time of trend reversal and the rate of change agreed extremely well with the model calibrated using both tritium measurements and the recession of spring discharge during the dry season. The agreement between predicted and observed values was however poorer for the springs displaying the most gentle recessions, possibly indicating a stronger influence of continuous groundwater recharge during the summer months.

  17. On the use of spring baseflow recession for a more accurate parameterization of aquifer transit time distribution functions

    NASA Astrophysics Data System (ADS)

    Farlin, J.; Maloszewski, P.

    2012-12-01

    Baseflow recession analysis and groundwater dating have up to now developed as two distinct branches of hydrogeology and were used to solve entirely different problems. We show that by combining two classical models, namely Boussinesq's Equation describing spring baseflow recession and the exponential piston-flow model used in groundwater dating studies, the parameters describing the transit time distribution of an aquifer can be in some cases estimated to a far more accurate degree than with the latter alone. Under the assumption that the aquifer basis is sub-horizontal, the mean residence time of water in the saturated zone can be estimated from spring baseflow recession. This provides an independent estimate of groundwater residence time that can refine those obtained from tritium measurements. This approach is demonstrated in a case study predicting atrazine concentration trend in a series of springs draining the fractured-rock aquifer known as the Luxembourg Sandstone. A transport model calibrated on tritium measurements alone predicted different times to trend reversal following the nationwide ban on atrazine in 2005 with different rates of decrease. For some of the springs, the best agreement between observed and predicted time of trend reversal was reached for the model calibrated using both tritium measurements and the recession of spring discharge during the dry season. The agreement between predicted and observed values was however poorer for the springs displaying the most gentle recessions, possibly indicating the stronger influence of continuous groundwater recharge during the dry period.

  18. Functional connectivity and structural covariance between regions of interest can be measured more accurately using multivariate distance correlation.

    PubMed

    Geerligs, Linda; Cam-Can; Henson, Richard N

    2016-07-15

    Studies of brain-wide functional connectivity or structural covariance typically use measures like the Pearson correlation coefficient, applied to data that have been averaged across voxels within regions of interest (ROIs). However, averaging across voxels may result in biased connectivity estimates when there is inhomogeneity within those ROIs, e.g., sub-regions that exhibit different patterns of functional connectivity or structural covariance. Here, we propose a new measure based on "distance correlation"; a test of multivariate dependence of high dimensional vectors, which allows for both linear and non-linear dependencies. We used simulations to show how distance correlation out-performs Pearson correlation in the face of inhomogeneous ROIs. To evaluate this new measure on real data, we use resting-state fMRI scans and T1 structural scans from 2 sessions on each of 214 participants from the Cambridge Centre for Ageing & Neuroscience (Cam-CAN) project. Pearson correlation and distance correlation showed similar average connectivity patterns, for both functional connectivity and structural covariance. Nevertheless, distance correlation was shown to be 1) more reliable across sessions, 2) more similar across participants, and 3) more robust to different sets of ROIs. Moreover, we found that the similarity between functional connectivity and structural covariance estimates was higher for distance correlation compared to Pearson correlation. We also explored the relative effects of different preprocessing options and motion artefacts on functional connectivity. Because distance correlation is easy to implement and fast to compute, it is a promising alternative to Pearson correlations for investigating ROI-based brain-wide connectivity patterns, for functional as well as structural data.

  19. The GATO gene annotation tool for research laboratories.

    PubMed

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB. PMID:16258624

  20. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs

    PubMed Central

    Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1−FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621

  1. Three dimensional neuronal cell cultures more accurately model voltage gated calcium channel functionality in freshly dissected nerve tissue.

    PubMed

    Lai, Yinzhi; Cheng, Ke; Kisaalita, William

    2012-01-01

    It has been demonstrated that neuronal cells cultured on traditional flat surfaces may exhibit exaggerated voltage gated calcium channel (VGCC) functionality. To gain a better understanding of this phenomenon, primary neuronal cells harvested from mice superior cervical ganglion (SCG) were cultured on two dimensional (2D) flat surfaces and in three dimensional (3D) synthetic poly-L-lactic acid (PLLA) and polystyrene (PS) polymer scaffolds. These 2D- and 3D-cultured cells were compared to cells in freshly dissected SCG tissues, with respect to intracellular calcium increase in response to high K(+) depolarization. The calcium increases were identical for 3D-cultured and freshly dissected, but significantly higher for 2D-cultured cells. This finding established the physiological relevance of 3D-cultured cells. To shed light on the mechanism behind the exaggerated 2D-cultured cells' functionality, transcriptase expression and related membrane protein distributions (caveolin-1) were obtained. Our results support the view that exaggerated VGCC functionality from 2D cultured SCG cells is possibly due to differences in membrane architecture, characterized by uniquely organized caveolar lipid rafts. The practical implication of use of 3D-cultured cells in preclinical drug discovery studies is that such platforms would be more effective in eliminating false positive hits and as such improve the overall yield from screening campaigns.

  2. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; et al

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  3. Re-annotation of the genome sequence of Helicobacter pylori 26695.

    PubMed

    Resende, Tiago; Correia, Daniela M; Rocha, Miguel; Rocha, Isabel

    2013-11-15

    Helicobacter pylori is a pathogenic bacterium that colonizes the human epithelia, causing duodenal and gastric ulcers, and gastric cancer. The genome of H. pylori 26695 has been previously sequenced and annotated. In addition, two genome-scale metabolic models have been developed. In order to maintain accurate and relevant information on coding sequences (CDS) and to retrieve new information, the assignment of new functions to Helicobacter pylori 26695s genes was performed in this work. The use of software tools, on-line databases and an annotation pipeline for inspecting each gene allowed the attribution of validated EC numbers and TC numbers to metabolic genes encoding enzymes and transport proteins, respectively. 1212 genes encoding proteins were identified in this annotation, being 712 metabolic genes and 500 non-metabolic, while 191 new functions were assignment to the CDS of this bacterium. This information provides relevant biological information for the scientific community dealing with this organism and can be used as the basis for a new metabolic model reconstruction.

  4. Drug Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Mathieson, Moira B.

    This bibliography consists of a total of 215 entries dealing with drug education, including curriculum guides, and drawn from documents in the ERIC system. There are two sections, the first containing 130 annotated citations of documents and journal articles, and the second containing 85 citations of journal articles without annotations, but with…

  5. Adult Basic Education Annotated Bibliography.

    ERIC Educational Resources Information Center

    Carter, Nancy B.

    This annotated bibliography contains sections divided according to area of study, and within each category materials are listed alphabetically by publisher. Publishers and mailing addresses are listed at the end of the bibliography. Throughout the annotations, whenever specific grade level divisions are not named, the regular Adult Basic Education…

  6. Morphosyntactic Annotation of CHILDES Transcripts

    ERIC Educational Resources Information Center

    Sagae, Kenji; Davis, Eric; Lavie, Alon; MacWhinney, Brian; Wintner, Shuly

    2010-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database…

  7. Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features

    NASA Astrophysics Data System (ADS)

    Li, Yuan; Wang, Mingjun; Wang, Huilin; Tan, Hao; Zhang, Ziding; Webb, Geoffrey I.; Song, Jiangning

    2014-07-01

    Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.

  8. An efficient and accurate approximation to time-dependent density functional theory for systems of weakly coupled monomers

    NASA Astrophysics Data System (ADS)

    Liu, Jie; Herbert, John M.

    2015-07-01

    A novel formulation of time-dependent density functional theory (TDDFT) is derived, based on non-orthogonal, absolutely-localized molecular orbitals (ALMOs). We call this approach TDDFT(MI), in reference to ALMO-based methods for describing molecular interactions (MI) that have been developed for ground-state applications. TDDFT(MI) is intended for efficient excited-state calculations in systems composed of multiple, weakly interacting chromophores. The efficiency is based upon (1) a local excitation approximation; (2) monomer-based, singly-excited basis states; (3) an efficient localization procedure; and (4) a one-step Davidson method to solve the TDDFT(MI) working equation. We apply this methodology to study molecular dimers, water clusters, solvated chromophores, and aggregates of naphthalene diimide that form the building blocks of self-assembling organic nanotubes. Absolute errors of 0.1-0.3 eV with respect to supersystem methods are achievable for these systems, especially for cases involving an excited chromophore that is weakly coupled to several explicit solvent molecules. Excited-state calculations in an aggregate of nine naphthalene diimide monomers are ˜40 times faster than traditional TDDFT calculations.

  9. De Novo Assembly, Functional Annotation and Comparative Analysis of Withania somnifera Leaf and Root Transcriptomes to Identify Putative Genes Involved in the Withanolides Biosynthesis

    PubMed Central

    Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

    2013-01-01

    Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches. PMID:23667511

  10. De novo assembly, functional annotation and comparative analysis of Withania somnifera leaf and root transcriptomes to identify putative genes involved in the withanolides biosynthesis.

    PubMed

    Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

    2013-01-01

    Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches.

  11. KSHV 2.0: A Comprehensive Annotation of the Kaposi's Sarcoma-Associated Herpesvirus Genome Using Next-Generation Sequencing Reveals Novel Genomic and Functional Features

    PubMed Central

    Arias, Carolina; Weisburd, Ben; Stern-Ginossar, Noam; Mercier, Alexandre; Madrid, Alexis S.; Bellare, Priya; Holdorf, Meghan; Weissman, Jonathan S.; Ganem, Don

    2014-01-01

    Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs), and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus. PMID:24453964

  12. A semi-automatic annotation tool for cooking video

    NASA Astrophysics Data System (ADS)

    Bianco, Simone; Ciocca, Gianluigi; Napoletano, Paolo; Schettini, Raimondo; Margherita, Roberto; Marini, Gianluca; Gianforme, Giorgio; Pantaleo, Giuseppe

    2013-03-01

    In order to create a cooking assistant application to guide the users in the preparation of the dishes relevant to their profile diets and food preferences, it is necessary to accurately annotate the video recipes, identifying and tracking the foods of the cook. These videos present particular annotation challenges such as frequent occlusions, food appearance changes, etc. Manually annotate the videos is a time-consuming, tedious and error-prone task. Fully automatic tools that integrate computer vision algorithms to extract and identify the elements of interest are not error free, and false positive and false negative detections need to be corrected in a post-processing stage. We present an interactive, semi-automatic tool for the annotation of cooking videos that integrates computer vision techniques under the supervision of the user. The annotation accuracy is increased with respect to completely automatic tools and the human effort is reduced with respect to completely manual ones. The performance and usability of the proposed tool are evaluated on the basis of the time and effort required to annotate the same video sequences.

  13. An accurate cluster selection function for the J-PAS narrow-band wide-field survey

    NASA Astrophysics Data System (ADS)

    Ascaso, B.; Benítez, N.; Dupke, R.; Cypriano, E.; Lima-Neto, G.; López-Sanjuan, C.; Varela, J.; Alcaniz, J. S.; Broadhurst, T.; Cenarro, A. J.; Devi, N. Chandrachani; Díaz-García, L. A.; Fernandes, C. A. C.; Hernández-Monteagudo, C.; Mei, S.; Mendes de Oliveira, C.; Molino, A.; Oteo, I.; Schoenell, W.; Sodré, L.; Viironen, K.; Marín-Franch, A.

    2016-03-01

    The impending Javalambre Physics of the accelerating Universe Astrophysical Survey (J-PAS) will be the first wide-field survey of ≳ 8500 deg2 to reach the `stage IV' category. Because of the redshift resolution afforded by 54 narrow-band filters, J-PAS is particularly suitable for cluster detection in the range z<1. The photometric redshift dispersion is estimated to be only ˜0.003 with few outliers ≲4 per cent for galaxies brighter than i ˜ 23 AB, because of the sensitivity of narrow band imaging to absorption and emission lines. Here, we evaluate the cluster selection function for J-PAS using N-body+semi-analytical realistic mock catalogues. We optimally detect clusters from this simulation with the Bayesian Cluster Finder, and we assess the completeness and purity of cluster detection against the mock data. The minimum halo mass threshold we find for detections of galaxy clusters and groups with both >80 per cent completeness and purity is Mh ˜ 5 × 1013 M⊙ up to z ˜ 0.7. We also model the optical observable, M^{*}_CL-halo mass relation, finding a non-evolution with redshift and main scatter of σ _{M^{*}_CL | M_h}˜ 0.14 dex down to a factor 2 lower in mass than other planned broad-band stage IV surveys, at least. For the Mh ˜ 1 × 1014 M⊙ Planck mass limit, J-PAS will arrive up to z ˜ 0.85 with a σ _{M^{*}_CL | M_h}˜ 0.12 dex. Therefore, J-PAS will provide the largest sample of clusters and groups up to z ˜ 0.8 with a mass calibration accuracy comparable to X-ray data.

  14. The route to MBxNyCz molecular wheels: II. Results using accurate functionals and basis sets

    NASA Astrophysics Data System (ADS)

    Güthler, A.; Mukhopadhyay, S.; Pandey, R.; Boustani, I.

    2014-04-01

    Applying ab initio quantum chemical methods, molecular wheels composed of metal and light atoms were investigated. High quality basis sets 6-31G*, TZPV, and cc-pVTZ as well as exchange and non-local correlation functionals B3LYP, BP86 and B3P86 were used. The ground-state energy and structures of cyclic planar and pyramidal clusters TiBn (for n = 3-10) were computed. In addition, the relative stability and electronic structures of molecular wheels TiBxNyCz (for x, y, z = 0-10) and MBnC10-n (for n = 2 to 5 and M = Sc to Zn) were determined. This paper sustains a follow-up study to the previous one of Boustani and Pandey [Solid State Sci. 14 (2012) 1591], in which the calculations were carried out at the HF-SCF/STO3G/6-31G level of theory to determine the initial stability and properties. The results show that there is a competition between the 2D planar and the 3D pyramidal TiBn clusters (for n = 3-8). Different isomers of TiB10 clusters were also studied and a structural transition of 3D-isomer into 2D-wheel is presented. Substitution boron in TiB10 by carbon or/and nitrogen atoms enhances the stability and leads toward the most stable wheel TiB3C7. Furthermore, the computations show that Sc, Ti and V at the center of the molecular wheels are energetically favored over other transition metal atoms of the first row.

  15. Accurate Analytic Potential Functions for the a ^3Π_1 and X ^1Σ^+ States of {IBr}

    NASA Astrophysics Data System (ADS)

    Yukiya, Tokio; Nishimiya, Nobuo; Suzuki, Masao; Le Roy, Robert

    2014-06-01

    Spectra of IBr in various wavelength regions have been measured by a number of researchers using traditional diffraction grating and microwave methods, as well as using high-resolution laser techniques combined with a Fourier transform spectrometer. In a previous paper at this meeting, we reported a preliminary determination of analytic potential energy functions for the A ^3Π_1 and X ^1Σ^+ states of IBr from a direct-potential-fit (DPF) analysis of all of the data available at that time. That study also confirmed the presence of anomalous fluctuations in the v--dependence of the first differences of the inertial rotational constant, Δ Bv=Bv+1-Bv in the A ^3Π_1 state for vibrational levels with v'(A) in the mid 20's. However, our previous experience in a recent study of the analogous A ^3Π_1-X ^1Σ_g^+ system of Br_2 suggested that the effect of such fluctuations may be overcome if sufficient data are available. The present work therefore reports new measurements of transitions to levels in the v'(A)=23-26 region, together with a new global DPF analysis that uses ``robust" least-squares fits to average properly over the effect of such fluctuations in order to provide an optimum delineation of the underlying potential energy curve(s). L.E.Selin,Ark. Fys. 21,479(1962) E. Tiemann and Th. Moeller, Z. Naturforsch. A 30,986 (1975) E.M. Weinstock and A. Preston, J. Mol. Spectrosc. 70, 188 (1978) D.R.T. Appadoo, P.F. Bernath, and R.J. Le Roy, Can. J. Phys. 72, 1265 (1994) N. Nishimiya, T. Yukiya and M. Suzuki, J. Mol. Spectrosc. 173, 8 (1995). T. Yukiya, N. Nishimiya, and R.J. Le Roy, Paper MF12 at the 65th Ohio State University International Symposium on Molecular Spectroscopy, Columbus, Ohio, June 20-24, 2011. T. Yukiya, N. Nishimiya, Y. Samajima, K. Yamaguchi, M. Suzuki, C.D. Boone, I. Ozier and R.J. Le Roy, J. Mol. Spectrosc. 283, 32 (2013) J.K.G. Watson, J. Mol. Spectrosc. 219, 326 (2003).

  16. EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

    SciTech Connect

    Pafilis, Evangelos; Buttigieg, Pier Luigi; Ferrell, Barbra; Pereira, Emiliano; Schnetzer, Julia; Arvanitidis, Christos; Jensen, Lars Juhl

    2016-01-01

    The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Here the comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.

  17. The accurate calculation of the band gap of liquid water by means of GW corrections applied to plane-wave density functional theory molecular dynamics simulations.

    PubMed

    Fang, Changming; Li, Wun-Fan; Koster, Rik S; Klimeš, Jiří; van Blaaderen, Alfons; van Huis, Marijn A

    2015-01-01

    Knowledge about the intrinsic electronic properties of water is imperative for understanding the behaviour of aqueous solutions that are used throughout biology, chemistry, physics, and industry. The calculation of the electronic band gap of liquids is challenging, because the most accurate ab initio approaches can be applied only to small numbers of atoms, while large numbers of atoms are required for having configurations that are representative of a liquid. Here we show that a high-accuracy value for the electronic band gap of water can be obtained by combining beyond-DFT methods and statistical time-averaging. Liquid water is simulated at 300 K using a plane-wave density functional theory molecular dynamics (PW-DFT-MD) simulation and a van der Waals density functional (optB88-vdW). After applying a self-consistent GW correction the band gap of liquid water at 300 K is calculated as 7.3 eV, in good agreement with recent experimental observations in the literature (6.9 eV). For simulations of phase transformations and chemical reactions in water or aqueous solutions whereby an accurate description of the electronic structure is required, we suggest to use these advanced GW corrections in combination with the statistical analysis of quantum mechanical MD simulations.

  18. GFam: a platform for automatic annotation of gene families

    PubMed Central

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-01-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/. PMID:22790981

  19. MetaStorm: A Public Resource for Customizable Metagenomics Annotation.

    PubMed

    Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S; Pruden, Amy; Xiao, Weidong; Zhang, Liqing

    2016-01-01

    Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution. PMID:27632579

  20. MetaStorm: A Public Resource for Customizable Metagenomics Annotation

    PubMed Central

    Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S.; Pruden, Amy; Xiao, Weidong; Zhang, Liqing

    2016-01-01

    Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution. PMID:27632579

  1. The near-equivalence of five species of spectrally-accurate radial basis functions (RBFs): Asymptotic approximations to the RBF cardinal functions on a uniform, unbounded grid

    NASA Astrophysics Data System (ADS)

    Boyd, John P.

    2011-02-01

    Radial basis function (RBF) interpolants have become popular in computer graphics, neural networks and for solving partial differential equations in many fields of science and engineering. In this article, we compare five different species of RBFs: Gaussians, hyperbolic secant (sech's), inverse quadratics, multiquadrics and inverse multiquadrics. We show that the corresponding cardinal functions for a uniform, unbounded grid are all approximated by the same function: C(X) ∼ (1/(ρ)) sin (πX)/sinh (πX/ρ) for some constant ρ(α) which depends on the inverse width parameter (“shape parameter”) α of the RBF and also on the RBF species. The error in this approximation is exponentially small in 1/α for sech's and inverse quadratics and exponentially small in 1/α2 for Gaussians; the error is proportional to α4 for multiquadrics and inverse multiquadrics. The error in all cases is small even for α ∼ O(1). These results generalize to higher dimensions. The Gaussian RBF cardinal functions in any number of dimensions d are, without approximation, the tensor product of one dimensional Gaussian cardinal functions: Cd(x1,x2…,xd)=∏j=1dC(xj). For other RBF species, we show that the two-dimensional cardinal functions are well approximated by the products of one-dimensional cardinal functions; again the error goes to zero as α → 0. The near-identity of the cardinal functions implies that all five species of RBF interpolants are (almost) the same, despite the great differences in the RBF ϕ's themselves.

  2. Annotated Bibliography on Religious Development.

    ERIC Educational Resources Information Center

    Bucher, Anton A.; Reich, K. Helmut

    1991-01-01

    Presents an annotated bibliography on religious development that covers the areas of psychology and religion, measurement of religiousness, religious development during the life cycle, religious experiences, conversion, religion and morality, and images of God. (Author/BB)

  3. Patient Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Simmons, Jeannette

    Topics included in this annotated bibliography on patient education are (1) background on development of patient education programs, (2) patient education interventions, (3) references for health professionals, and (4) research and evaluation in patient education. (TA)

  4. Accurate monotone cubic interpolation

    NASA Technical Reports Server (NTRS)

    Huynh, Hung T.

    1991-01-01

    Monotone piecewise cubic interpolants are simple and effective. They are generally third-order accurate, except near strict local extrema where accuracy degenerates to second-order due to the monotonicity constraint. Algorithms for piecewise cubic interpolants, which preserve monotonicity as well as uniform third and fourth-order accuracy are presented. The gain of accuracy is obtained by relaxing the monotonicity constraint in a geometric framework in which the median function plays a crucial role.

  5. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment

    PubMed Central

    Kann, Maricel G.; Sheetlin, Sergey L.; Park, Yonil; Bryant, Stephen H.; Spouge, John L.

    2007-01-01

    The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. PMID:17596268

  6. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.

  7. CRISPRdigger: detecting CRISPRs with better direct repeat annotations.

    PubMed

    Ge, Ruiquan; Mai, Guoqin; Wang, Pu; Zhou, Manli; Luo, Youxi; Cai, Yunpeng; Zhou, Fengfeng

    2016-01-01

    Clustered regularly interspaced short palindromic repeats (CRISPRs) are important genetic elements in many bacterial and archaeal genomes, and play a key role in prokaryote immune systems' fight against invasive foreign elements. The CRISPR system has also been engineered to facilitate target gene editing in eukaryotic genomes. Using the common features of mis-annotated CRISPRs in prokaryotic genomes, this study proposed an accurate de novo CRISPR annotation program CRISPRdigger, which can take a partially assembled genome as its input. A comprehensive comparison with the three existing programs demonstrated that CRISPRdigger can recover more Direct Repeats (DRs) for CRISPRs and achieve a higher accuracy for a query genome. The program was implemented by Perl and all the parameters had default values, so that a user could annotate CRISPRs in a query genome by supplying only a genome sequence in the FASTA format. All the supplementary data are available at http://www.healthinformaticslab.org/supp/. PMID:27596864

  8. CRISPRdigger: detecting CRISPRs with better direct repeat annotations

    PubMed Central

    Ge, Ruiquan; Mai, Guoqin; Wang, Pu; Zhou, Manli; Luo, Youxi; Cai, Yunpeng; Zhou, Fengfeng

    2016-01-01

    Clustered regularly interspaced short palindromic repeats (CRISPRs) are important genetic elements in many bacterial and archaeal genomes, and play a key role in prokaryote immune systems’ fight against invasive foreign elements. The CRISPR system has also been engineered to facilitate target gene editing in eukaryotic genomes. Using the common features of mis-annotated CRISPRs in prokaryotic genomes, this study proposed an accurate de novo CRISPR annotation program CRISPRdigger, which can take a partially assembled genome as its input. A comprehensive comparison with the three existing programs demonstrated that CRISPRdigger can recover more Direct Repeats (DRs) for CRISPRs and achieve a higher accuracy for a query genome. The program was implemented by Perl and all the parameters had default values, so that a user could annotate CRISPRs in a query genome by supplying only a genome sequence in the FASTA format. All the supplementary data are available at http://www.healthinformaticslab.org/supp/. PMID:27596864

  9. Semantic annotation of biological concepts interplaying microbial cellular responses

    PubMed Central

    2011-01-01

    Background Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes. Results Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules), proteins (transcription factors, enzymes and transporters), small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts) and compounds (most frequently annotated concepts), whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts. Conclusions To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes. Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts. PMID:22122862

  10. Accurate structure, thermodynamics and spectroscopy of medium-sized radicals by hybrid Coupled Cluster/Density Functional Theory approaches: the case of phenyl radical

    PubMed Central

    Barone, Vincenzo; Biczysko, Malgorzata; Bloino, Julien; Egidi, Franco; Puzzarini, Cristina

    2015-01-01

    The CCSD(T) model coupled with extrapolation to the complete basis-set limit and additive approaches represents the “golden standard” for the structural and spectroscopic characterization of building blocks of biomolecules and nanosystems. However, when open-shell systems are considered, additional problems related to both specific computational difficulties and the need of obtaining spin-dependent properties appear. In this contribution, we present a comprehensive study of the molecular structure and spectroscopic (IR, Raman, EPR) properties of the phenyl radical with the aim of validating an accurate computational protocol able to deal with conjugated open-shell species. We succeeded in obtaining reliable and accurate results, thus confirming and, partly, extending the available experimental data. The main issue to be pointed out is the need of going beyond the CCSD(T) level by including a full treatment of triple excitations in order to fulfil the accuracy requirements. On the other hand, the reliability of density functional theory in properly treating open-shell systems has been further confirmed. PMID:23802956

  11. Two-component density functional theory within the projector augmented-wave approach: Accurate and self-consistent computations of positron lifetimes and momentum distributions

    NASA Astrophysics Data System (ADS)

    Wiktor, Julia; Jomard, Gérald; Torrent, Marc

    2015-09-01

    Many techniques have been developed in the past in order to compute positron lifetimes in materials from first principles. However, there is still a lack of a fast and accurate self-consistent scheme that could handle accurately the forces acting on the ions induced by the presence of the positron. We will show in this paper that we have reached this goal by developing the two-component density functional theory within the projector augmented-wave (PAW) method in the open-source code abinit. This tool offers the accuracy of the all-electron methods with the computational efficiency of the plane-wave ones. We can thus deal with supercells that contain few hundreds to thousands of atoms to study point defects as well as more extended defects clusters. Moreover, using the PAW basis set allows us to use techniques able to, for instance, treat strongly correlated systems or spin-orbit coupling, which are necessary to study heavy elements, such as the actinides or their compounds.

  12. CART—a chemical annotation retrieval toolkit

    PubMed Central

    Deghou, Samy; Zeller, Georg; Iskar, Murat; Driessen, Marja; Castillo, Mercedes; van Noort, Vera; Bork, Peer

    2016-01-01

    Motivation: Data on bioactivities of drug-like chemicals are rapidly accumulating in public repositories, creating new opportunities for research in computational systems pharmacology. However, integrative analysis of these data sets is difficult due to prevailing ambiguity between chemical names and identifiers and a lack of cross-references between databases. Results: To address this challenge, we have developed CART, a Chemical Annotation Retrieval Toolkit. As a key functionality, it matches an input list of chemical names into a comprehensive reference space to assign unambiguous chemical identifiers. In this unified space, bioactivity annotations can be easily retrieved from databases covering a wide variety of chemical effects on biological systems. Subsequently, CART can determine annotations enriched in the input set of chemicals and display these in tabular format and interactive network visualizations, thereby facilitating integrative analysis of chemical bioactivity data. Availability and Implementation: CART is available as a Galaxy web service (cart.embl.de). Source code and an easy-to-install command line tool can also be obtained from the web site. Contact: bork@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27256313

  13. UCSC Data Integrator and Variant Annotation Integrator

    PubMed Central

    Hinrichs, Angie S.; Raney, Brian J.; Speir, Matthew L.; Rhead, Brooke; Casper, Jonathan; Karolchik, Donna; Kuhn, Robert M.; Rosenbloom, Kate R.; Zweig, Ann S.; Haussler, David; Kent, W. James

    2016-01-01

    Summary: Two new tools on the UCSC Genome Browser web site provide improved ways of combining information from multiple datasets, optionally including the user's own custom track data and/or data from track hubs. The Data Integrator combines columns from multiple data tracks, showing all items from the first track along with overlapping items from the other tracks. The Variant Annotation Integrator is tailored to adding functional annotations to variant calls; it offers a more restricted set of underlying data tracks but adds predictions of each variant's consequences for any overlapping or nearby gene transcript. When available, it optionally adds additional annotations including effect prediction scores from dbNSFP for missense mutations, ENCODE regulatory summary tracks and conservation scores. Availability and implementation: The web tools are freely available at http://genome.ucsc.edu/ and the underlying database is available for download at http://hgdownload.cse.ucsc.edu/. The software (written in C and Javascript) is available from https://genome-store.ucsc.edu/ and is freely available for academic and non-profit usage; commercial users must obtain a license. Contact: angie@soe.ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740527

  14. A DPF Analysis Yields Quantum Mechanically Accurate Analytic Potential Energy Functions for the a ^1Σ^+ and X ^1Σ^+ States of NaH

    NASA Astrophysics Data System (ADS)

    Le Roy, Robert J.; Walji, Sadru; Sentjens, Katherine

    2013-06-01

    Alkali hydride diatomic molecules have long been the object of spectroscopic studies. However, their small reduced mass makes them species for which the conventional semiclassical-based methods of analysis tend to have the largest errors. To date, the only quantum-mechanically accurate direct-potential-fit (DPF) analysis for one of these molecules was the one for LiH reported by Coxon and Dickinson. The present paper extends this level of analysis to NaH, and reports a DPF analysis of all available spectroscopic data for the A ^1Σ^+-X ^1Σ^+ system of NaH which yields analytic potential energy functions for these two states that account for those data (on average) to within the experimental uncertainties. W.C. Stwalley, W.T. Zemke and S.C. Yang, J. Phys. Chem. Ref. Data {20}, 153-187 (1991). J.A. Coxon and C.S. Dickinson, J. Chem. Phys. {121}, 8378 (2004).

  15. Mouse genome annotation by the RefSeq project.

    PubMed

    McGarvey, Kelly M; Goldfarb, Tamara; Cox, Eric; Farrell, Catherine M; Gupta, Tripti; Joardar, Vinita S; Kodali, Vamsi K; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Webb, David; Wright, Mathew W; Murphy, Terence D; Pruitt, Kim D

    2015-10-01

    Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.

  16. Structural and functional screening in human induced-pluripotent stem cell-derived cardiomyocytes accurately identifies cardiotoxicity of multiple drug types

    SciTech Connect

    Doherty, Kimberly R. Talbert, Dominique R.; Trusk, Patricia B.; Moran, Diarmuid M.; Shell, Scott A.; Bacus, Sarah

    2015-05-15

    Safety pharmacology studies that evaluate new drug entities for potential cardiac liability remain a critical component of drug development. Current studies have shown that in vitro tests utilizing human induced pluripotent stem cell-derived cardiomyocytes (hiPS-CM) may be beneficial for preclinical risk evaluation. We recently demonstrated that an in vitro multi-parameter test panel assessing overall cardiac health and function could accurately reflect the associated clinical cardiotoxicity of 4 FDA-approved targeted oncology agents using hiPS-CM. The present studies expand upon this initial observation to assess whether this in vitro screen could detect cardiotoxicity across multiple drug classes with known clinical cardiac risks. Thus, 24 drugs were examined for their effect on both structural (viability, reactive oxygen species generation, lipid formation, troponin secretion) and functional (beating activity) endpoints in hiPS-CM. Using this screen, the cardiac-safe drugs showed no effects on any of the tests in our panel. However, 16 of 18 compounds with known clinical cardiac risk showed drug-induced changes in hiPS-CM by at least one method. Moreover, when taking into account the Cmax values, these 16 compounds could be further classified depending on whether the effects were structural, functional, or both. Overall, the most sensitive test assessed cardiac beating using the xCELLigence platform (88.9%) while the structural endpoints provided additional insight into the mechanism of cardiotoxicity for several drugs. These studies show that a multi-parameter approach examining both cardiac cell health and function in hiPS-CM provides a comprehensive and robust assessment that can aid in the determination of potential cardiac liability. - Highlights: • 24 drugs were tested for cardiac liability using an in vitro multi-parameter screen. • Changes in beating activity were the most sensitive in predicting cardiac risk. • Structural effects add in

  17. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

  18. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; Tripp, H. James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A.; Pati, Amrita; et al

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  19. Collective dynamics of social annotation.

    PubMed

    Cattuto, Ciro; Barrat, Alain; Baldassarri, Andrea; Schehr, Gregory; Loreto, Vittorio

    2009-06-30

    The enormous increase of popularity and use of the worldwide web has led in the recent years to important changes in the ways people communicate. An interesting example of this fact is provided by the now very popular social annotation systems, through which users annotate resources (such as web pages or digital photographs) with keywords known as "tags." Understanding the rich emergent structures resulting from the uncoordinated actions of users calls for an interdisciplinary effort. In particular concepts borrowed from statistical physics, such as random walks (RWs), and complex networks theory, can effectively contribute to the mathematical modeling of social annotation systems. Here, we show that the process of social annotation can be seen as a collective but uncoordinated exploration of an underlying semantic space, pictured as a graph, through a series of RWs. This modeling framework reproduces several aspects, thus far unexplained, of social annotation, among which are the peculiar growth of the size of the vocabulary used by the community and its complex network structure that represents an externalization of semantic structures grounded in cognition and that are typically hard to access. PMID:19506244

  20. Metagenomic gene annotation by a homology-independent approach

    SciTech Connect

    Froula, Jeff; Zhang, Tao; Salmeen, Annette; Hess, Matthias; Kerfeld, Cheryl A.; Wang, Zhong; Du, Changbin

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMER but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.

  1. Automatic annotation of eukaryotic genes, pseudogenes and promoters

    PubMed Central

    Solovyev, Victor; Kosarev, Peter; Seledsov, Igor; Vorobyev, Denis

    2006-01-01

    Background The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. Results The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. Conclusion We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome. PMID:16925832

  2. A survey on annotation tools for the biomedical literature.

    PubMed

    Neves, Mariana; Leser, Ulf

    2014-03-01

    New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

  3. RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates

    PubMed Central

    Rajput, Bhanu; Murphy, Terence D.; Pruitt, Kim D.

    2015-01-01

    Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records. PMID:26170238

  4. RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates.

    PubMed

    Rajput, Bhanu; Murphy, Terence D; Pruitt, Kim D

    2015-09-01

    Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records. PMID:26170238

  5. Prediction of organ toxicity endpoints by QSAR modeling based on precise chemical-histopathology annotations.

    PubMed

    Myshkin, Eugene; Brennan, Richard; Khasanova, Tatiana; Sitnik, Tatiana; Serebriyskaya, Tatiana; Litvinova, Elena; Guryanov, Alexey; Nikolsky, Yuri; Nikolskaya, Tatiana; Bureeva, Svetlana

    2012-09-01

    The ability to accurately predict the toxicity of drug candidates from their chemical structure is critical for guiding experimental drug discovery toward safer medicines. Under the guidance of the MetaTox consortium (Thomson Reuters, CA, USA), which comprised toxicologists from the pharmaceutical industry and government agencies, we created a comprehensive ontology of toxic pathologies for 19 organs, classifying pathology terms by pathology type and functional organ substructure. By manual annotation of full-text research articles, the ontology was populated with chemical compounds causing specific histopathologies. Annotated compound-toxicity associations defined histologically from rat and mouse experiments were used to build quantitative structure-activity relationship models predicting subcategories of liver and kidney toxicity: liver necrosis, liver relative weight gain, liver lipid accumulation, nephron injury, kidney relative weight gain, and kidney necrosis. All models were validated using two independent test sets and demonstrated overall good performance: initial validation showed 0.80-0.96 sensitivity (correctly predicted toxic compounds) and 0.85-1.00 specificity (correctly predicted non-toxic compounds). Later validation against a test set of compounds newly added to the database in the 2 years following initial model generation showed 75-87% sensitivity and 60-78% specificity. General hepatotoxicity and nephrotoxicity models were less accurate, as expected for more complex endpoints.

  6. Draft Genome Sequence and Gene Annotation of the Entomopathogenic Fungus Verticillium hemipterigenum

    PubMed Central

    Horn, Fabian; Habel, Andreas; Scharf, Daniel H.; Dworschak, Jan; Brakhage, Axel A.; Guthke, Reinhard

    2015-01-01

    Verticillium hemipterigenum (anamorph Torrubiella hemipterigena) is an entomopathogenic fungus and produces a broad range of secondary metabolites. Here, we present the draft genome sequence of the fungus, including gene structure and functional annotation. Genes were predicted incorporating RNA-Seq data and functionally annotated to provide the basis for further genome studies. PMID:25614560

  7. Structural and functional screening in human induced-pluripotent stem cell-derived cardiomyocytes accurately identifies cardiotoxicity of multiple drug types.

    PubMed

    Doherty, Kimberly R; Talbert, Dominique R; Trusk, Patricia B; Moran, Diarmuid M; Shell, Scott A; Bacus, Sarah

    2015-05-15

    Safety pharmacology studies that evaluate new drug entities for potential cardiac liability remain a critical component of drug development. Current studies have shown that in vitro tests utilizing human induced pluripotent stem cell-derived cardiomyocytes (hiPS-CM) may be beneficial for preclinical risk evaluation. We recently demonstrated that an in vitro multi-parameter test panel assessing overall cardiac health and function could accurately reflect the associated clinical cardiotoxicity of 4 FDA-approved targeted oncology agents using hiPS-CM. The present studies expand upon this initial observation to assess whether this in vitro screen could detect cardiotoxicity across multiple drug classes with known clinical cardiac risks. Thus, 24 drugs were examined for their effect on both structural (viability, reactive oxygen species generation, lipid formation, troponin secretion) and functional (beating activity) endpoints in hiPS-CM. Using this screen, the cardiac-safe drugs showed no effects on any of the tests in our panel. However, 16 of 18 compounds with known clinical cardiac risk showed drug-induced changes in hiPS-CM by at least one method. Moreover, when taking into account the Cmax values, these 16 compounds could be further classified depending on whether the effects were structural, functional, or both. Overall, the most sensitive test assessed cardiac beating using the xCELLigence platform (88.9%) while the structural endpoints provided additional insight into the mechanism of cardiotoxicity for several drugs. These studies show that a multi-parameter approach examining both cardiac cell health and function in hiPS-CM provides a comprehensive and robust assessment that can aid in the determination of potential cardiac liability.

  8. AGeS: A Software System for Microbial Genome Sequence Annotation

    PubMed Central

    Kumar, Kamal; Desai, Valmik; Cheng, Li; Khitrov, Maxim; Grover, Deepak; Satya, Ravi Vijaya; Yu, Chenggang; Zavaljevski, Nela; Reifman, Jaques

    2011-01-01

    Background The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. Methodology The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions. PMID:21408217

  9. Functional annotation of the mesophilic-like character of mutants in a cold-adapted enzyme by self-organising map analysis of their molecular dynamics.

    PubMed

    Fraccalvieri, Domenico; Tiberti, Matteo; Pandini, Alessandro; Bonati, Laura; Papaleo, Elena

    2012-10-01

    Multiple comparison of the Molecular Dynamics (MD) trajectories of mutants in a cold-adapted α-amylase (AHA) could be used to elucidate functional features required to restore mesophilic-like activity. Unfortunately it is challenging to identify the different dynamic behaviors and correctly relate them to functional activity by routine analysis. We here employed a previously developed and robust two-stage approach that combines Self-Organising Maps (SOMs) and hierarchical clustering to compare conformational ensembles of proteins. Moreover, we designed a novel strategy to identify the specific mutations that more efficiently convert the dynamic signature of the psychrophilic enzyme (AHA) to that of the mesophilic counterpart (PPA). The SOM trained on AHA and its variants was used to classify a PPA MD ensemble and successfully highlighted the relationships between the flexibilities of the target enzyme and of the different mutants. Moreover the local features of the mutants that mostly influence their global flexibility in a mesophilic-like direction were detected. It turns out that mutations of the cold-adapted enzyme to hydrophobic and aromatic residues are the most effective in restoring the PPA dynamic features and could guide the design of more mesophilic-like mutants. In conclusion, our strategy can efficiently extract specific dynamic signatures related to function from multiple comparisons of MD conformational ensembles. Therefore, it can be a promising tool for protein engineering.

  10. Assessing the Drosophila melanogaster and Anopheles gambiae Genome Annotations Using Genome-Wide Sequence Comparisons

    PubMed Central

    Jaillon, Olivier; Dossat, Carole; Eckenberg, Ralph; Eiglmeier, Karin; Segurens, Béatrice; Aury, Jean-Marc; Roth, Charles W.; Scarpelli, Claude; Brey, Paul T.; Weissenbach, Jean; Wincker, Patrick

    2003-01-01

    We performed genome-wide sequence comparisons at the protein coding level between the genome sequences of Drosophila melanogaster and Anopheles gambiae. Such comparisons detect evolutionarily conserved regions (ecores) that can be used for a qualitative and quantitative evaluation of the available annotations of both genomes. They also provide novel candidate features for annotation. The percentage of ecores mapping outside annotations in the A. gambiae genome is about fourfold higher than in D. melanogaster. The A. gambiae genome assembly also contains a high proportion of duplicated ecores, possibly resulting from artefactual sequence duplications in the genome assembly. The occurrence of 4063 ecores in the D. melanogaster genome outside annotations suggests that some genes are not yet or only partially annotated. The present work illustrates the power of comparative genomics approaches towards an exhaustive and accurate establishment of gene models and gene catalogues in insect genomes. PMID:12840038

  11. Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of “Unknown Function”

    PubMed Central

    Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

    2012-01-01

    Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000 Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs. PMID:22645570

  12. Preserving sequence annotations across reference sequences

    PubMed Central

    2014-01-01

    Background Matching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used. This makes the integration of annotations from different sources difficult and error prone. Results As part of our effort to create linked data for interoperable sequence annotations, we present an RDF data model for sequence annotation using the ontological framework established by the OBO Foundry ontologies and the Basic Formal Ontology (BFO). We defined reference sequences as the common domain of integration for sequence annotations, and identified three semantic relationships between sequence annotations. In doing so, we created the Reference Sequence Annotation to compensate for gaps in the SO and in its mapping to BFO, particularly for annotations that refer to versions of consensus reference sequences. Moreover, we present three integration models for sequence annotations using different reference assemblies. Conclusions We demonstrated a working example of a sequence annotation instance, and how this instance can be linked to other annotations on different reference sequences. Sequence annotations in this format are semantically rich and can be integrated easily with different assemblies. We also identify other challenges of modeling reference sequences with the BFO. PMID:25093075

  13. Suppression subtractive hybridization (SSH) combined with bioinformatics method: an integrated functional annotation approach for analysis of differentially expressed immune-genes in insects.

    PubMed

    Badapanda, Chandan

    2013-01-01

    The suppression subtractive hybridization (SSH) approach, a PCR based approach which amplifies differentially expressed cDNAs (complementary DNAs), while simultaneously suppressing amplification of common cDNAs, was employed to identify immuneinducible genes in insects. This technique has been used as a suitable tool for experimental identification of novel genes in eukaryotes as well as prokaryotes; whose genomes have been sequenced, or the species whose genomes have yet to be sequenced. In this article, I have proposed a method for in silico functional characterization of immune-inducible genes from insects. Apart from immune-inducible genes from insects, this method can be applied for the analysis of genes from other species, starting from bacteria to plants and animals. This article is provided with a background of SSH-based method taking specific examples from innate immune-inducible genes in insects, and subsequently a bioinformatics pipeline is proposed for functional characterization of newly sequenced genes. The proposed workflow presented here, can also be applied for any newly sequenced species generated from Next Generation Sequencing (NGS) platforms.

  14. Infant Feeding: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Crowhurst, Christine Marie, Comp.; Kumer, Bonnie Lee, Comp.

    Intended for parents, health professionals and allied health workers, and others involved in caring for infants and young children, this annotated bibliography brings together in one selective listing a review of over 700 current publications related to infant feeding. Reflecting current knowledge in infant feeding, the bibliography has as its…

  15. English Language Learners: Annotated Bibliography

    ERIC Educational Resources Information Center

    Hector-Mason, Anestine; Bardack, Sarah

    2010-01-01

    This annotated bibliography represents a first step toward compiling a comprehensive overview of current research on issues related to English language learners (ELLs). It is intended to be a resource for researchers, policymakers, administrators, and educators who are engaged in efforts to bridge the divide between research, policy, and practice…

  16. Appalachian Women. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Hamm, Mary Margo

    This bibliography compiles annotations of 178 books, journal articles, ERIC documents, and dissertations on Appalachian women and their social, cultural, and economic environment. Entries were published 1966-93 and are listed in the following categories: (1) authors and literary criticism; (2) bibliographies and resource guides; (3) economics,…

  17. Radiocarbon Dating: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Fortine, Suellen

    This selective annotated bibliography covers various sources of information on the radiocarbon dating method, including journal articles, conference proceedings, and reports, reflecting the most important and useful sources of the last 25 years. The bibliography is divided into five parts--general background on radiocarbon, radiocarbon dating,…

  18. Hispanic Heritage. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Denver Univ., CO. School of Education.

    This annotated bibliography of a wide range of materials for the social studies teacher is concerned with the Hispano heritage. The sections are introduced by a brief description. The sections are: 1) general materials, 2) the land and the people, 3) the European background, 4) Spain's colonial system, 5) the Spanish borderlands, 6) the Anglo…

  19. Rural Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Massey, Sara

    The 120-item annotated bibliography was compiled to facilitate the development of a recently approved course entitled "Topics in Rural Education" at the University of Maine at Machias. Although the dates range from 1964 to 1982, most of the materials were prepared in the 1970s and 1980s. The interrelatedness of the issues makes categorization…

  20. Workforce Reductions. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Hickok, Thomas A.; Hickok, Thomas A.

    This report, which is based on a review of practitioner-oriented sources and scholarly journals, uses a three-part framework to organize annotated bibliographies that, together, list a total of 104 sources that provide the following three perspectives on work force reduction issues: organizational, organizational-individual relationship, and…

  1. Vietnamese Amerasians: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Johnson, Mark C.; And Others

    This annotated bibliography on Vietnamese Amerasians includes primary and secondary sources as well as reviews of three documentary films. Sources were selected in order to provide an overview of the historical and political context of Amerasian resettlement and a review of the scant available research on coping and adaptation with this…

  2. Instructional Materials Centers; Annotated Bibliography.

    ERIC Educational Resources Information Center

    Poli, Rosario, Comp.

    An annotated bibliography lists 74 articles and reports on instructional materials centers (IMC) which appeared from 1967-70. The articles deal with such topics as the purposes of an IMC, guidelines for setting up an IMC, and the relationship of an IMC to technology. Most articles deal with use of an IMC on an elementary or secondary level, but…

  3. Nikos Kazantzakis: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Qiu, Kui

    This research paper consists of an annotated bibliography about Nikos Kazantzakis, one of the major modern Greek writers and author of "The Last Temptation of Christ,""Zorba the Greek," and many other works. Because of Kazantzakis' position in world literature there are many critical works about him; however, bibliographical control of these works…

  4. An Annotated Bibliography for Art.

    ERIC Educational Resources Information Center

    Minnesota State Dept. of Education, St. Paul. Div. of Instruction.

    The annotated bibliography presents approximately 450 references about art for elementary, secondary, and professional levels. It is presented in three sections. Section one identifies 19 resources about art from a professional or teaching perspective. Included are books explaining how to teach various techniques to students of beginning or…

  5. Annotated Bibliography on Humanistic Education

    ERIC Educational Resources Information Center

    Ganung, Cynthia

    1975-01-01

    Part I of this annotated bibliography deals with books and articles on such topics as achievement motivation, process education, transactional analysis, discipline without punishment, role-playing, interpersonal skills, self-acceptance, moral education, self-awareness, values clarification, and non-verbal communication. Part II focuses on…

  6. MSDAC Resource Library Annotated Bibliography.

    ERIC Educational Resources Information Center

    Watson, Cristel; And Others

    This annotated bibliography lists books, films, filmstrips, recordings, and booklets on sex equity. Entries are arranged according to the following topics: career resources, curriculum resources, management, sex equity, sex roles, women's studies, student activities, and sex-fair fiction. Included in each entry are name of author, editor or…

  7. Multicultural Education. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Narang, H. L.

    This annotated bibliography contains references to books, journal articles, ERIC documents, doctoral dissertations, and audio-visual materials on the subject of multicultural education. Topics include integrating multiculturalism in school subjects, prejudice and discrimination, intercultural communication, ethnic identity and ethnic bias.…

  8. Systems Theory and Communication. Annotated Bibliography.

    ERIC Educational Resources Information Center

    Covington, William G., Jr.

    This annotated bibliography presents annotations of 31 books and journal articles dealing with systems theory and its relation to organizational communication, marketing, information theory, and cybernetics. Materials were published between 1963 and 1992 and are listed alphabetically by author. (RS)

  9. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    SciTech Connect

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  10. MEGANTE: a web-based system for integrated plant genome annotation.

    PubMed

    Numa, Hisataka; Itoh, Takeshi

    2014-01-01

    The recent advancement of high-throughput genome sequencing technologies has resulted in a considerable increase in demands for large-scale genome annotation. While annotation is a crucial step for downstream data analyses and experimental studies, this process requires substantial expertise and knowledge of bioinformatics. Here we present MEGANTE, a web-based annotation system that makes plant genome annotation easy for researchers unfamiliar with bioinformatics. Without any complicated configuration, users can perform genomic sequence annotations simply by uploading a sequence and selecting the species to query. MEGANTE automatically runs several analysis programs and integrates the results to select the appropriate consensus exon-intron structures and to predict open reading frames (ORFs) at each locus. Functional annotation, including a similarity search against known proteins and a functional domain search, are also performed for the predicted ORFs. The resultant annotation information is visualized with a widely used genome browser, GBrowse. For ease of analysis, the results can be downloaded in Microsoft Excel format. All of the query sequences and annotation results are stored on the server side so that users can access their own data from virtually anywhere on the web. The current release of MEGANTE targets 24 plant species from the Brassicaceae, Fabaceae, Musaceae, Poaceae, Salicaceae, Solanaceae, Rosaceae and Vitaceae families, and it allows users to submit a sequence up to 10 Mb in length and to save up to 100 sequences with the annotation information on the server. The MEGANTE web service is available at https://megante.dna.affrc.go.jp/.

  11. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    SciTech Connect

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.; Lu, V.; Podila, G. K.; Collart, F. R.; Biosciences Division; Univ. of Alabama

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there

  12. Using Deep RNA Sequencing for the Structural Annotation of the Laccaria Bicolor Mycorrhizal Transcriptome

    PubMed Central

    Larsen, Peter E.; Trivedi, Geetika; Sreedasyam, Avinash; Lu, Vincent; Podila, Gopi K.; Collart, Frank R.

    2010-01-01

    Background Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. Methodology We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. Conclusions 69% of expressed mycorrhizal JGI “best” gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural

  13. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    PubMed Central

    2011-01-01

    Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets. PMID:22192575

  14. ANNOTATED BIBLIOGRAPHY ON CREATIVITY AND GIFTEDNESS.

    ERIC Educational Resources Information Center

    GOWAN, JOHN CURTIS

    THIS ANNOTATED BIBLIOGRAPHY REPRESENTS A SAMPLING OF PUBLISHED WRITING ON CREATIVITY AND GIFTED CHILDREN SINCE 1960. THE LIST WAS COMPILED FOR EDUCATIONAL RESEARCHERS. IN A FEW INSTANCES THE ANNOTATIONS HAVE BEEN MODIFIED OR ABRIDGED FROM THOSE FOUND IN "PSYCHOLOGICAL ABSTRACTS" OR OTHER JOURNAL ABSTRACTS. SOME OF THE ANNOTATIONS HAVE PREVIOUSLY…

  15. Alcohol Education Materials; An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Milgram, Gail Gleason

    This 873-item annotated bibliography cites books, pamphlets, leaflets, and other materials produced for education about alcohol from 1950 to May 1973. The major part of each annotation is a brief summary of the contents. The annotation also contains a statement of orientation or type of presentation and evaluative comments. Each item is classified…

  16. Annotation and Classification of Argumentative Writing Revisions

    ERIC Educational Resources Information Center

    Zhang, Fan; Litman, Diane

    2015-01-01

    This paper explores the annotation and classification of students' revision behaviors in argumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with…

  17. Protein Function Prediction: Problems and Pitfalls.

    PubMed

    Pearson, William R

    2015-01-01

    The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood. PMID:26334923

  18. Protein Function Prediction: Problems and Pitfalls.

    PubMed

    Pearson, William R

    2015-01-01

    The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood.

  19. Accurate Detection of Adenylation Domain Functions in Nonribosomal Peptide Synthetases by an Enzyme-linked Immunosorbent Assay System Using Active Site-directed Probes for Adenylation Domains.

    PubMed

    Ishikawa, Fumihiro; Miyamoto, Kengo; Konno, Sho; Kasai, Shota; Kakeya, Hideaki

    2015-12-18

    A significant gap exists between protein engineering and enzymes used for the biosynthesis of natural products, largely because there is a paucity of strategies that rapidly detect active-site phenotypes of the enzymes with desired activities. Herein, we describe a proof-of-concept study of an enzyme-linked immunosorbent assay (ELISA) system for the adenylation (A) domains in nonribosomal peptide synthetases (NRPSs) using a combination of active site-directed probes coupled to a 5'-O-N-(aminoacyl)sulfamoyladenosine scaffold with a biotin functionality that immobilizes probe molecules onto a streptavidin-coated solid support. The recombinant NRPSs have a C-terminal His-tag motif that is targeted by an anti-6×His mouse antibody as the primary antibody and a horseradish peroxidase-linked goat antimouse antibody as the secondary antibody. These probes can selectively capture the cognate A domains by ligand-directed targeting. In addition, the ELISA technique detected A domains in the crude cell-free homogenates from the Escherichia coli expression systems. When coupled with a chromogenic substrate, the antibody-based ELISA technique can visualize probe-protein binding interactions, which provides accurate readouts of the A-domain functions in NRPS enzymes. To assess the ELISA-based engineering of the A domains of NRPSs, we reprogramed 2,3-dihydroxybenzoic acid (DHB)-activating enzyme EntE toward salicylic acid (Sal)-activating enzymes and investigated a correlation between binding properties for probe molecules and enzyme catalysts. We generated a mutant of EntE that displayed negligible loss in the kcat/Km value with the noncognate substrate Sal and a corresponding 48-fold decrease in the kcat/Km value with the cognate substrate DHB. The resulting 26-fold switch in substrate specificity was achieved by the replacement of a Ser residue in the active site of EntE with a Cys toward the nonribosomal codes of Sal-activating enzymes. Bringing a laboratory ELISA technique

  20. A multi-objective optimization approach accurately resolves protein domain architectures

    PubMed Central

    Bernardes, J.S.; Vieira, F.R.J.; Zaverucha, G.; Carbone, A.

    2016-01-01

    Motivation: Given a protein sequence and a number of potential domains matching it, what are the domain content and the most likely domain architecture for the sequence? This problem is of fundamental importance in protein annotation, constituting one of the main steps of all predictive annotation strategies. On the other hand, when potential domains are several and in conflict because of overlapping domain boundaries, finding a solution for the problem might become difficult. An accurate prediction of the domain architecture of a multi-domain protein provides important information for function prediction, comparative genomics and molecular evolution. Results: We developed DAMA (Domain Annotation by a Multi-objective Approach), a novel approach that identifies architectures through a multi-objective optimization algorithm combining scores of domain matches, previously observed multi-domain co-occurrence and domain overlapping. DAMA has been validated on a known benchmark dataset based on CATH structural domain assignments and on the set of Plasmodium falciparum proteins. When compared with existing tools on both datasets, it outperforms all of them. Availability and implementation: DAMA software is implemented in C++ and the source code can be found at http://www.lcqb.upmc.fr/DAMA. Contact: juliana.silva_bernardes@upmc.fr or alessandra.carbone@lip6.fr Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26458889

  1. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    SciTech Connect

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to

  2. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    DOE PAGES

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genesmore » and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed,