Science.gov

Sample records for annotation databases improves

  1. Considerations to improve functional annotations in biological databases.

    PubMed

    Benítez-Páez, Alfonso

    2009-12-01

    Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community.

  2. Incorporating the human gene annotations in different databases significantly improved transcriptomic and genetic analyses.

    PubMed

    Chen, Geng; Wang, Charles; Shi, Leming; Qu, Xiongfei; Chen, Jiwei; Yang, Jianmin; Shi, Caiping; Chen, Long; Zhou, Peiying; Ning, Baitang; Tong, Weida; Shi, Tieliu

    2013-04-01

    Human gene annotation is crucial for conducting transcriptomic and genetic studies; however, the impacts of human gene annotations in diverse databases on related studies have been less evaluated. To enable full use of various human annotation resources and better understand the human transcriptome, here we systematically compare the human annotations present in RefSeq, Ensembl (GENCODE), and AceView on diverse transcriptomic and genetic analyses. We found that the human gene annotations in the three databases are far from complete. Although Ensembl and AceView annotated more genes than RefSeq, more than 15,800 genes from Ensembl (or AceView) are within the intergenic and intronic regions of AceView (or Ensembl) annotation. The human transcriptome annotations in RefSeq, Ensembl, and AceView had distinct effects on short-read mapping, gene and isoform expression profiling, and differential expression calling. Furthermore, our findings indicate that the integrated annotation of these databases can obtain a more complete gene set and significantly enhance those transcriptomic analyses. We also observed that many more known SNPs were located within genes annotated in Ensembl and AceView than in RefSeq. In particular, 1033 of 3041 trait/disease-associated SNPs involved in about 200 human traits/diseases that were previously reported to be in RefSeq intergenic regions could be relocated within Ensembl and AceView genes. Our findings illustrate that a more complete transcriptome generated by incorporating human gene annotations in diverse databases can strikingly improve the overall results of transcriptomic and genetic studies.

  3. STANDARDIZATION AND STRUCTURAL ANNOTATION OF PUBLIC TOXICITY DATABASES: IMPROVING SAR CAPABILITIES AND LINKAGE TO 'OMICS DATA

    EPA Science Inventory

    Standardization and structural annotation of public toxicity databases: Improving SAR capabilities and linkage to 'omics data
    Ann M. Richard', ClarLynda Williams', Jamie Burch2
    'Nat Health & Environ Res Lab, US EPA, RTP, NC 27711; 2EPA/NC Central Univ Student COOP Trainee<...

  4. STANDARDIZATION AND STRUCTURAL ANNOTATION OF PUBLIC TOXICITY DATABASES: IMPROVING SAR CAPABILITIES AND LINKAGE TO 'OMICS DATA

    EPA Science Inventory

    Standardization and structural annotation of public toxicity databases: Improving SAR capabilities and linkage to 'omics data
    Ann M. Richard', ClarLynda Williams', Jamie Burch2
    'Nat Health & Environ Res Lab, US EPA, RTP, NC 27711; 2EPA/NC Central Univ Student COOP Trainee<...

  5. The annotation and the usage of scientific databases could be improved with public issue tracker software

    PubMed Central

    Dall'Olio, Giovanni Marco; Bertranpetit, Jaume; Laayouni, Hafid

    2010-01-01

    Since the publication of their longtime predecessor The Atlas of Protein Sequences and Structures in 1965 by Margaret Dayhoff, scientific databases have become a key factor in the organization of modern science. All the information and knowledge described in the novel scientific literature is translated into entries in many different scientific databases, making it possible to obtain very accurate information on a biological entity like genes or proteins without having to manually review the literature on it. However, even for the databases with the finest annotation procedures, errors or unclear parts sometimes appear in the publicly released version and influence the research of unaware scientists using them. The researcher that finds an error in a database is often left in a uncertain state, and often abandons the effort of reporting it because of a lack of a standard procedure to do so. In the present work, we propose that the simple adoption of a public error tracker application, as in many open software projects, could improve the quality of the annotations in many databases and encourage feedback from the scientific community on the data annotated publicly. In order to illustrate the situation, we describe a series of errors that we found and helped solve on the genes of a very well-known pathway in various biomedically relevant databases. We would like to show that, even if a majority of the most important scientific databases have procedures for reporting errors, these are usually not publicly visible, making the process of reporting errors time consuming and not useful. Also, the effort made by the user that reports the error often goes unacknowledged, putting him in a discouraging position. PMID:21186182

  6. Annotation and retrieval in protein interaction databases

    NASA Astrophysics Data System (ADS)

    Cannataro, Mario; Hiram Guzzi, Pietro; Veltri, Pierangelo

    2014-06-01

    Biological databases have been developed with a special focus on the efficient retrieval of single records or the efficient computation of specialized bioinformatics algorithms against the overall database, such as in sequence alignment. The continuos production of biological knowledge spread on several biological databases and ontologies, such as Gene Ontology, and the availability of efficient techniques to handle such knowledge, such as annotation and semantic similarity measures, enable the development on novel bioinformatics applications that explicitly use and integrate such knowledge. After introducing the annotation process and the main semantic similarity measures, this paper shows how annotations and semantic similarity can be exploited to improve the extraction and analysis of biologically relevant data from protein interaction databases. As case studies, the paper presents two novel software tools, OntoPIN and CytoSeVis, both based on the use of Gene Ontology annotations, for the advanced querying of protein interaction databases and for the enhanced visualization of protein interaction networks.

  7. FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database

    PubMed Central

    Pfreundt, Ulrike; James, Daniel P.; Tweedie, Susan; Wilson, Derek; Teichmann, Sarah A.; Adryan, Boris

    2010-01-01

    FlyTF (http://www.flytf.org) is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes’ official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs. PMID:19884132

  8. FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database.

    PubMed

    Pfreundt, Ulrike; James, Daniel P; Tweedie, Susan; Wilson, Derek; Teichmann, Sarah A; Adryan, Boris

    2010-01-01

    FlyTF (http://www.flytf.org) is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes' official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs.

  9. The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations

    PubMed Central

    Cerqueira, Gustavo C.; Arnaud, Martha B.; Inglis, Diane O.; Skrzypek, Marek S.; Binkley, Gail; Simison, Matt; Miyasato, Stuart R.; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R.

    2014-01-01

    The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome. PMID:24194595

  10. Mining GO annotations for improving annotation consistency.

    PubMed

    Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, António E N; Albrecht, Mario; Falcão, André O

    2012-01-01

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

  11. Bovine Genome Database: integrated tools for genome annotation and discovery.

    PubMed

    Childers, Christopher P; Reese, Justin T; Sundaram, Jaideep P; Vile, Donald C; Dickens, C Michael; Childs, Kevin L; Salih, Hanni; Bennett, Anna K; Hagen, Darren E; Adelson, David L; Elsik, Christine G

    2011-01-01

    The Bovine Genome Database (BGD; http://BovineGenome.org) strives to improve annotation of the bovine genome and to integrate the genome sequence with other genomics data. BGD includes GBrowse genome browsers, the Apollo Annotation Editor, a quantitative trait loci (QTL) viewer, BLAST databases and gene pages. Genome browsers, available for both scaffold and chromosome coordinate systems, display the bovine Official Gene Set (OGS), RefSeq and Ensembl gene models, non-coding RNA, repeats, pseudogenes, single-nucleotide polymorphism, markers, QTL and alignments to complementary DNAs, ESTs and protein homologs. The Bovine QTL viewer is connected to the BGD Chromosome GBrowse, allowing for the identification of candidate genes underlying QTL. The Apollo Annotation Editor connects directly to the BGD Chado database to provide researchers with remote access to gene evidence in a graphical interface that allows editing and creating new gene models. Researchers may upload their annotations to the BGD server for review and integration into the subsequent release of the OGS. Gene pages display information for individual OGS gene models, including gene structure, transcript variants, functional descriptions, gene symbols, Gene Ontology terms, annotator comments and links to National Center for Biotechnology Information and Ensembl. Each gene page is linked to a wiki page to allow input from the research community.

  12. The Annotation-enriched non-redundant patent sequence databases

    PubMed Central

    Li, Weizhong; Kondratowicz, Bartosz; McWilliam, Hamish; Nauche, Stephane; Lopez, Rodrigo

    2013-01-01

    The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases. Database URL: http://www.ebi.ac.uk/patentdata/nr/ PMID:23396323

  13. Improving genome annotations using phylogenetic profile anomaly detection.

    PubMed

    Mikkelsen, Tarjei S; Galagan, James E; Mesirov, Jill P

    2005-02-15

    A promising strategy for refining genome annotations is to detect features that conflict with known functional or evolutionary relationships between groups of genes. Previous work in this area has been focused on investigating the absence of 'housekeeping' genes or components of well-studied pathways. We have sought to develop a method for improving new annotations that can automatically synthesize and use the information available in a database of other annotated genomes. We show that a probabilistic model of phylogenetic profiles, trained from a database of curated genome annotations, can be used to reliably detect errors in new annotations. We use our method to identify 22 genes that were missed in previously published annotations of prokaryotic genomes. The method was evaluated using MATLAB and open source software referenced in this work. Scripts and datasets are available from the authors upon request. tarjei@broad.mit.edu.

  14. MODBASE, a database of annotated comparative protein structure models.

    PubMed

    Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C; Ilyin, Valentin A; Sali, Andrej

    2002-01-01

    MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10(-4)) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server.

  15. A computational platform to maintain and migrate manual functional annotations for BioCyc databases.

    PubMed

    Walsh, Jesse R; Sen, Taner Z; Dickerson, Julie A

    2014-10-12

    BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database. We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers. Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.

  16. The GOA database: Gene Ontology annotation updates for 2015

    PubMed Central

    Huntley, Rachael P.; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J.; O'Donovan, Claire

    2015-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats. PMID:25378336

  17. Improving gene annotation using peptide mass spectrometry

    PubMed Central

    Tanner, Stephen; Shen, Zhouxin; Ng, Julio; Florea, Liliana; Guigó, Roderic; Briggs, Steven P.; Bafna, Vineet

    2007-01-01

    Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project. PMID:17189379

  18. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine

    PubMed Central

    Elsik, Christine G.; Tayal, Aditi; Diesh, Colin M.; Unni, Deepak R.; Emery, Marianne L.; Nguyen, Hung N.; Hagen, Darren E.

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  19. ANEXdb: an integrated animal ANnotation and microarray EXpression database.

    PubMed

    Couture, Oliver; Callenberg, Keith; Koul, Neeraj; Pandit, Sushain; Younes, Remy; Hu, Zhi-Liang; Dekkers, Jack; Reecy, James; Honavar, Vasant; Tuggle, Christopher

    2009-01-01

    To determine annotations of the sequence elements on microarrays used for transcriptional profiling experiments in livestock species, currently researchers must either use the sparse direct annotations available for these species or create their own annotations. ANEXdb ( http://www.anexdb.org ) is an open-source web application that supports integrated access of two databases that house microarray expression (ExpressDB) and EST annotation (AnnotDB) data. The expression database currently supports storage and querying of Affymetrix-based expression data as well as retrieval of experiments in a form ready for NCBI-GEO submission; these services are available online. AnnotDB currently houses a novel assembly of approximately 1.6 million unique porcine-expressed sequence reads called the Iowa Porcine Assembly (IPA), which consists of 140,087 consensus sequences, the Iowa Tentative Consensus (ITC) sequences, and 103,888 singletons. The IPA has been annotated via transfer of information from homologs identified through sequence alignment to NCBI RefSeq. These annotated sequences have been mapped to the Affymetrix porcine array elements, providing annotation for 22,569 of the 23,937 (94%) porcine-specific probe sets, of which 19,253 (80%) are linked to an NCBI RefSeq entry. The ITC has also been mined for sequence variation, providing evidence for up to 202,383 SNPs, 62,048 deletions, and 958 insertions in porcine-expressed sequence. These results create a single location to obtain porcine annotation of and sequence variation in differently expressed genes in expression experiments, thus permitting possible identification of causal variants in such genes of interest. The ANEXdb application is open source and available from SourceForge.net.

  20. Alga-PrAS (Algal Protein Annotation Suite): A Database of Comprehensive Annotation in Algal Proteomes

    PubMed Central

    Kurotani, Atsushi; Yamada, Yutaka

    2017-01-01

    Algae are smaller organisms than land plants and offer clear advantages in research over terrestrial species in terms of rapid production, short generation time and varied commercial applications. Thus, studies investigating the practical development of effective algal production are important and will improve our understanding of both aquatic and terrestrial plants. In this study we estimated multiple physicochemical and secondary structural properties of protein sequences, the predicted presence of post-translational modification (PTM) sites, and subcellular localization using a total of 510,123 protein sequences from the proteomes of 31 algal and three plant species. Algal species were broadly selected from green and red algae, glaucophytes, oomycetes, diatoms and other microalgal groups. The results were deposited in the Algal Protein Annotation Suite database (Alga-PrAS; http://alga-pras.riken.jp/), which can be freely accessed online. PMID:28069893

  1. Alga-PrAS (Algal Protein Annotation Suite): A Database of Comprehensive Annotation in Algal Proteomes.

    PubMed

    Kurotani, Atsushi; Yamada, Yutaka; Sakurai, Tetsuya

    2017-01-01

    Algae are smaller organisms than land plants and offer clear advantages in research over terrestrial species in terms of rapid production, short generation time and varied commercial applications. Thus, studies investigating the practical development of effective algal production are important and will improve our understanding of both aquatic and terrestrial plants. In this study we estimated multiple physicochemical and secondary structural properties of protein sequences, the predicted presence of post-translational modification (PTM) sites, and subcellular localization using a total of 510,123 protein sequences from the proteomes of 31 algal and three plant species. Algal species were broadly selected from green and red algae, glaucophytes, oomycetes, diatoms and other microalgal groups. The results were deposited in the Algal Protein Annotation Suite database (Alga-PrAS; http://alga-pras.riken.jp/), which can be freely accessed online. © The Author 2017. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists.

  2. NECTAR: a database of codon-centric missense variant annotations.

    PubMed

    Gong, Sungsam; Ware, James S; Walsh, Roddy; Cook, Stuart A

    2014-01-01

    NECTAR (Non-synonymous Enriched Coding muTation ARchive; http://nectarmutation.org) is a database and web application to annotate disease-related and functionally important amino acids in human proteins. A number of tools are available to facilitate the interpretation of DNA variants identified in diagnostic or research sequencing. These typically identify previous reports of DNA variation at a given genomic location, predict its effects on transcript and protein sequence and may predict downstream functional consequences. Previous reports and functional annotations are typically linked by the genomic location of the variant observed. NECTAR collates disease-causing variants and functionally important amino acid residues from a number of sources. Importantly, rather than simply linking annotations by a shared genomic location, NECTAR annotates variants of interest with details of previously reported variation affecting the same codon. This provides a much richer data set for the interpretation of a novel DNA variant. NECTAR also identifies functionally equivalent amino acid residues in evolutionarily related proteins (paralogues) and, where appropriate, transfers annotations between them. As well as accessing these data through a web interface, users can upload batches of variants in variant call format (VCF) for annotation on-the-fly. The database is freely available to download from the ftp site: ftp://ftp.nectarmutation.org.

  3. Large-scale annotation of small-molecule libraries using public databases.

    PubMed

    Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

    2007-01-01

    While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.

  4. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    PubMed

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-04

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  5. microRNAs Databases: Developmental Methodologies, Structural and Functional Annotations.

    PubMed

    Singh, Nagendra Kumar

    2017-09-01

    microRNA (miRNA) is an endogenous and evolutionary conserved non-coding RNA, involved in post-transcriptional process as gene repressor and mRNA cleavage through RNA-induced silencing complex (RISC) formation. In RISC, miRNA binds in complementary base pair with targeted mRNA along with Argonaut proteins complex, causes gene repression or endonucleolytic cleavage of mRNAs and results in many diseases and syndromes. After the discovery of miRNA lin-4 and let-7, subsequently large numbers of miRNAs were discovered by low-throughput and high-throughput experimental techniques along with computational process in various biological and metabolic processes. The miRNAs are important non-coding RNA for understanding the complex biological phenomena of organism because it controls the gene regulation. This paper reviews miRNA databases with structural and functional annotations developed by various researchers. These databases contain structural and functional information of animal, plant and virus miRNAs including miRNAs-associated diseases, stress resistance in plant, miRNAs take part in various biological processes, effect of miRNAs interaction on drugs and environment, effect of variance on miRNAs, miRNAs gene expression analysis, sequence of miRNAs, structure of miRNAs. This review focuses on the developmental methodology of miRNA databases such as computational tools and methods used for extraction of miRNAs annotation from different resources or through experiment. This study also discusses the efficiency of user interface design of every database along with current entry and annotations of miRNA (pathways, gene ontology, disease ontology, etc.). Here, an integrated schematic diagram of construction process for databases is also drawn along with tabular and graphical comparison of various types of entries in different databases. Aim of this paper is to present the importance of miRNAs-related resources at a single place.

  6. SATPdb: a database of structurally annotated therapeutic peptides.

    PubMed

    Singh, Sandeep; Chaudhary, Kumardeep; Dhanda, Sandeep Kumar; Bhalla, Sherry; Usmani, Salman Sadullah; Gautam, Ankur; Tuknait, Abhishek; Agrawal, Piyush; Mathur, Deepika; Raghava, Gajendra P S

    2016-01-04

    SATPdb (http://crdd.osdd.net/raghava/satpdb/) is a database of structurally annotated therapeutic peptides, curated from 22 public domain peptide databases/datasets including 9 of our own. The current version holds 19192 unique experimentally validated therapeutic peptide sequences having length between 2 and 50 amino acids. It covers peptides having natural, non-natural and modified residues. These peptides were systematically grouped into 10 categories based on their major function or therapeutic property like 1099 anticancer, 10585 antimicrobial, 1642 drug delivery and 1698 antihypertensive peptides. We assigned or annotated structure of these therapeutic peptides using structural databases (Protein Data Bank) and state-of-the-art structure prediction methods like I-TASSER, HHsearch and PEPstrMOD. In addition, SATPdb facilitates users in performing various tasks that include: (i) structure and sequence similarity search, (ii) peptide browsing based on their function and properties, (iii) identification of moonlighting peptides and (iv) searching of peptides having desired structure and therapeutic activities. We hope this database will be useful for researchers working in the field of peptide-based therapeutics.

  7. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches.

  8. Predicting active site residue annotations in the Pfam database.

    PubMed

    Mistry, Jaina; Bateman, Alex; Finn, Robert D

    2007-08-09

    Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.

  9. Vespucci: a system for building annotated databases of nascent transcripts

    PubMed Central

    Allison, Karmel A.; Kaikkonen, Minna U.; Gaasterland, Terry; Glass, Christopher K.

    2014-01-01

    Global run-on sequencing (GRO-seq) is a recent addition to the series of high-throughput sequencing methods that enables new insights into transcriptional dynamics within a cell. However, GRO-sequencing presents new algorithmic challenges, as existing analysis platforms for ChIP-seq and RNA-seq do not address the unique problem of identifying transcriptional units de novo from short reads located all across the genome. Here, we present a novel algorithm for de novo transcript identification from GRO-sequencing data, along with a system that determines transcript regions, stores them in a relational database and associates them with known reference annotations. We use this method to analyze GRO-sequencing data from primary mouse macrophages and derive novel quantitative insights into the extent and characteristics of non-coding transcription in mammalian cells. In doing so, we demonstrate that Vespucci expands existing annotations for mRNAs and lincRNAs by defining the primary transcript beyond the polyadenylation site. In addition, Vespucci generates assemblies for un-annotated non-coding RNAs such as those transcribed from enhancer-like elements. Vespucci thereby provides a robust system for defining, storing and analyzing diverse classes of primary RNA transcripts that are of increasing biological interest. PMID:24304890

  10. An Improved microRNA Annotation of the Canine Genome.

    PubMed

    Penso-Dolfin, Luca; Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions.

  11. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  12. An annotated database of Arabidopsis mutants of acyl lipid metabolism

    DOE PAGES

    McGlew, Kathleen; Shaw, Vincent; Zhang, Meng; ...

    2014-12-10

    Mutants have played a fundamental role in gene discovery and in understanding the function of genes involved in plant acyl lipid metabolism. The first mutant in Arabidopsis lipid metabolism (fad4) was described in 1985. Since that time, characterization of mutants in more than 280 genes associated with acyl lipid metabolism has been reported. This review provides a brief background and history on identification of mutants in acyl lipid metabolism, an analysis of the distribution of mutants in different areas of acyl lipid metabolism and presents an annotated database (ARALIPmutantDB) of these mutants. The database provides information on the phenotypes ofmore » mutants, pathways and enzymes/proteins associated with the mutants, and allows rapid access via hyperlinks to summaries of information about each mutant and to literature that provides information on the lipid composition of the mutants. Mutants for at least 30 % of the genes in the database have multiple names, which have been compiled here to reduce ambiguities in searches for information. Furthermore, the database should also provide a tool for exploring the relationships between mutants in acyl lipid-related genes and their lipid phenotypes and point to opportunities for further research.« less

  13. An annotated database of Arabidopsis mutants of acyl lipid metabolism

    SciTech Connect

    McGlew, Kathleen; Shaw, Vincent; Zhang, Meng; Kim, Ryeo Jin; Yang, Weili; Shorrosh, Basil; Suh, Mi Chung; Ohlrogge, John

    2014-12-10

    Mutants have played a fundamental role in gene discovery and in understanding the function of genes involved in plant acyl lipid metabolism. The first mutant in Arabidopsis lipid metabolism (fad4) was described in 1985. Since that time, characterization of mutants in more than 280 genes associated with acyl lipid metabolism has been reported. This review provides a brief background and history on identification of mutants in acyl lipid metabolism, an analysis of the distribution of mutants in different areas of acyl lipid metabolism and presents an annotated database (ARALIPmutantDB) of these mutants. The database provides information on the phenotypes of mutants, pathways and enzymes/proteins associated with the mutants, and allows rapid access via hyperlinks to summaries of information about each mutant and to literature that provides information on the lipid composition of the mutants. Mutants for at least 30 % of the genes in the database have multiple names, which have been compiled here to reduce ambiguities in searches for information. Furthermore, the database should also provide a tool for exploring the relationships between mutants in acyl lipid-related genes and their lipid phenotypes and point to opportunities for further research.

  14. A manual curation strategy to improve genome annotation: application to a set of haloarchael genomes.

    PubMed

    Pfeiffer, Friedhelm; Oesterhelt, Dieter

    2015-06-02

    Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.

  15. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  16. MitoFish and MitoAnnotator: a mitochondrial genome database of fish with an accurate and automatic annotation pipeline.

    PubMed

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-11-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

  17. MannDB: A microbial annotation database for protein characterization

    SciTech Connect

    Zhou, C; Lam, M; Smith, J; Zemla, A; Dyer, M; Kuczmarski, T; Vitalis, E; Slezak, T

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high

  18. An integrated computational pipeline and database to support whole-genome sequence annotation.

    PubMed

    Mungall, C J; Misra, S; Berman, B P; Carlson, J; Frise, E; Harris, N; Marshall, B; Shu, S; Kaminker, J S; Prochnik, S E; Smith, C D; Smith, E; Tupy, J L; Wiel, C; Rubin, G M; Lewis, S E

    2002-01-01

    We describe here our experience in annotating the Drosophila melanogaster genome sequence, in the course of which we developed several new open-source software tools and a database schema to support large-scale genome annotation. We have developed these into an integrated and reusable software system for whole-genome annotation. The key contributions to overall annotation quality are the marshalling of high-quality sequences for alignments and the design of a system with an adaptable and expandable flexible architecture.

  19. Improving functional annotation for industrial microbes: a case study with Pichia pastoris

    PubMed Central

    Dikicioglu, Duygu; Wood, Valerie; Rutherford, Kim M.; McDowall, Mark D.; Oliver, Stephen G.

    2014-01-01

    The research communities studying microbial model organisms, such as Escherichia coli or Saccharomyces cerevisiae, are well served by model organism databases that have extensive functional annotation. However, this is not true of many industrial microbes that are used widely in biotechnology. In this Opinion piece, we use Pichia (Komagataella) pastoris to illustrate the limitations of the available annotation. We consider the resources that can be implemented in the short term both to improve Gene Ontology (GO) annotation coverage based on annotation transfer, and to establish curation pipelines for the literature corpus of this organism. PMID:24929579

  20. Building and Querying RDF/OWL Database of Semantically Annotated Nuclear Medicine Images.

    PubMed

    Hwang, Kyung Hoon; Lee, Haejun; Koh, Geon; Willrett, Debra; Rubin, Daniel L

    2017-02-01

    As the use of positron emission tomography-computed tomography (PET-CT) has increased rapidly, there is a need to retrieve relevant medical images that can assist image interpretation. However, the images themselves lack the explicit information needed for query. We constructed a semantically structured database of nuclear medicine images using the Annotation and Image Markup (AIM) format and evaluated the ability the AIM annotations to improve image search. We created AIM annotation templates specific to the nuclear medicine domain and used them to annotate 100 nuclear medicine PET-CT studies in AIM format using controlled vocabulary. We evaluated image retrieval from 20 specific clinical queries. As the gold standard, two nuclear medicine physicians manually retrieved the relevant images from the image database using free text search of radiology reports for the same queries. We compared query results with the manually retrieved results obtained by the physicians. The query performance indicated a 98 % recall for simple queries and a 89 % recall for complex queries. In total, the queries provided 95 % (75 of 79 images) recall, 100 % precision, and an F1 score of 0.97 for the 20 clinical queries. Three of the four images missed by the queries required reasoning for successful retrieval. Nuclear medicine images augmented using semantic annotations in AIM enabled high recall and precision for simple queries, helping physicians to retrieve the relevant images. Further study using a larger data set and the implementation of an inference engine may improve query results for more complex queries.

  1. An Improved microRNA Annotation of the Canine Genome

    PubMed Central

    Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  2. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  3. Combining computational models, semantic annotations and simulation experiments in a graph database

    PubMed Central

    Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar

    2015-01-01

    Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/ PMID:25754863

  4. Combining computational models, semantic annotations and simulation experiments in a graph database.

    PubMed

    Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar

    2015-01-01

    Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models' structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/

  5. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study.

    PubMed

    Costanzo, Maria C; Park, Julie; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2011-01-01

    Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned 'unknown' annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL: http://www.yeastgenome.org.

  6. CycADS: an annotation database system to ease the development and update of BioCyc databases

    PubMed Central

    Vellozo, Augusto F.; Véron, Amélie S.; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E.; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano

    2011-01-01

    In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http

  7. CycADS: an annotation database system to ease the development and update of BioCyc databases.

    PubMed

    Vellozo, Augusto F; Véron, Amélie S; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano

    2011-01-01

    In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http://www.cycadsys.org.

  8. miRNAs target databases: developmental methods and target identification techniques with functional annotations.

    PubMed

    Singh, Nagendra Kumar

    2017-06-01

    microRNA (miRNA) regulates diverse biological mechanisms and metabolisms in plants and animals. Thus, the discoveries of miRNA has revolutionized the life sciences and medical research.The miRNA represses and cleaves the targeted mRNA by binding perfect or near perfect or imperfect complementary base pairs by RNA-induced silencing complex (RISC) formation during biogenesis process. One miRNA interacts with one or more mRNA genes and vice versa, hence takes part in causing various diseases. In this paper, the different microRNA target databases and their functional annotations developed by various researchers have been reviewed. The concurrent research review aims at comprehending the significance of miRNA and presenting the existing status of annotated miRNA target resources built by researchers henceforth discovering the knowledge for diagnosis and prognosis. This review discusses the applications and developmental methodologies for constructing target database as well as the utility of user interface design. An integrated architecture is drawn and a graphically comparative study of present status of miRNA targets in diverse diseases and various biological processes is performed. These databases comprise of information such as miRNA target-associated disease, transcription factor binding sites (TFBSs) in miRNA genomic locations, polymorphism in miRNA target, A-to-I edited target, Gene Ontology (GO), genome annotations, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, target expression analysis, TF-miRNA and miRNA-mRNA interaction networks, drugs-targets interactions, etc. miRNA target databases contain diverse experimentally and computationally predicted target through various algorithms. The comparison of various miRNA target database has been performed on various parameters. The computationally predicted target databases suffer from false positive information as there is no common theory for prediction of miRNA targets. The review conclusion emphasizes

  9. Applying negative rule mining to improve genome annotation

    PubMed Central

    Artamonova, Irena I; Frishman, Goar; Frishman, Dmitrij

    2007-01-01

    Background Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection. PMID:17659089

  10. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

    PubMed Central

    O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  11. SALAD database: a motif-based database of protein annotations for plant comparative genomics.

    PubMed

    Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi

    2010-01-01

    Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209,529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named 'SALAD on ARRAYs' to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis.

  12. SALAD database: a motif-based database of protein annotations for plant comparative genomics

    PubMed Central

    Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi

    2010-01-01

    Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named ‘SALAD on ARRAYs’ to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis. PMID:19854933

  13. Evaluation of relational and NoSQL database architectures to manage genomic annotations.

    PubMed

    Schulz, Wade L; Nelson, Brent G; Felker, Donn K; Durant, Thomas J S; Torres, Richard

    2016-12-01

    While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences.

  14. An integrated computational pipeline and database to support whole-genome sequence annotation

    PubMed Central

    Mungall, CJ; Misra, S; Berman, BP; Carlson, J; Frise, E; Harris, N; Marshall, B; Shu, S; Kaminker, JS; Prochnik, SE; Smith, CD; Smith, E; Tupy, JL; Wiel, C; Rubin, GM; Lewis, SE

    2002-01-01

    We describe here our experience in annotating the Drosophila melanogaster genome sequence, in the course of which we developed several new open-source software tools and a database schema to support large-scale genome annotation. We have developed these into an integrated and reusable software system for whole-genome annotation. The key contributions to overall annotation quality are the marshalling of high-quality sequences for alignments and the design of a system with an adaptable and expandable flexible architecture. PMID:12537570

  15. BioBuilder as a database development and functional annotation platform for proteins

    PubMed Central

    Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

    2004-01-01

    Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

  16. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

    PubMed Central

    Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

    2013-01-01

    The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284

  17. ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species

    PubMed Central

    Zeng, Victor; Extavour, Cassandra G.

    2012-01-01

    The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental

  18. ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.

    PubMed

    Zeng, Victor; Extavour, Cassandra G

    2012-01-01

    The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental

  19. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts.

    PubMed

    Yamasaki, Chisato; Murakami, Katsuhiko; Fujii, Yasuyuki; Sato, Yoshiharu; Harada, Erimi; Takeda, Jun-ichi; Taniya, Takayuki; Sakate, Ryuichi; Kikugawa, Shingo; Shimada, Makoto; Tanino, Motohiko; Koyanagi, Kanako O; Barrero, Roberto A; Gough, Craig; Chun, Hong-Woo; Habara, Takuya; Hanaoka, Hideki; Hayakawa, Yosuke; Hilton, Phillip B; Kaneko, Yayoi; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Matsuya, Akihiro; Nagata, Naoki; Nishikata, Kensaku; Noda, Akiko Ogura; Nurimoto, Shin; Saichi, Naomi; Sakai, Hiroaki; Sanbonmatsu, Ryoko; Shiba, Rie; Suzuki, Mami; Takabayashi, Kazuhiko; Takahashi, Aiko; Tamura, Takuro; Tanaka, Masayuki; Tanaka, Susumu; Todokoro, Fusano; Yamaguchi, Kaori; Yamamoto, Naoyuki; Okido, Toshihisa; Mashima, Jun; Hashizume, Aki; Jin, Lihua; Lee, Kyung-Bum; Lin, Yi-Chueh; Nozaki, Asami; Sakai, Katsunaga; Tada, Masahito; Miyazaki, Satoru; Makino, Takashi; Ohyanagi, Hajime; Osato, Naoki; Tanaka, Nobuhiko; Suzuki, Yoshiyuki; Ikeo, Kazuho; Saitou, Naruya; Sugawara, Hideaki; O'Donovan, Claire; Kulikova, Tamara; Whitfield, Eleanor; Halligan, Brian; Shimoyama, Mary; Twigger, Simon; Yura, Kei; Kimura, Kouichi; Yasuda, Tomohiro; Nishikawa, Tetsuo; Akiyama, Yutaka; Motono, Chie; Mukai, Yuri; Nagasaki, Hideki; Suwa, Makiko; Horton, Paul; Kikuno, Reiko; Ohara, Osamu; Lancet, Doron; Eveno, Eric; Graudens, Esther; Imbeaud, Sandrine; Debily, Marie Anne; Hayashizaki, Yoshihide; Amid, Clara; Han, Michael; Osanger, Andreas; Endo, Toshinori; Thomas, Michael A; Hirakawa, Mika; Makalowski, Wojciech; Nakao, Mitsuteru; Kim, Nam-Soon; Yoo, Hyang-Sook; De Souza, Sandro J; Bonaldo, Maria de Fatima; Niimura, Yoshihito; Kuryshev, Vladimir; Schupp, Ingo; Wiemann, Stefan; Bellgard, Matthew; Shionyu, Masafumi; Jia, Libin; Thierry-Mieg, Danielle; Thierry-Mieg, Jean; Wagner, Lukas; Zhang, Qinghua; Go, Mitiko; Minoshima, Shinsei; Ohtsubo, Masafumi; Hanada, Kousuke; Tonellato, Peter; Isogai, Takao; Zhang, Ji; Lenhard, Boris; Kim, Sangsoo; Chen, Zhu; Hinz, Ursula; Estreicher, Anne; Nakai, Kenta; Makalowska, Izabela; Hide, Winston; Tiffin, Nicola; Wilming, Laurens; Chakraborty, Ranajit; Soares, Marcelo Bento; Chiusano, Maria Luisa; Suzuki, Yutaka; Auffray, Charles; Yamaguchi-Kabata, Yumi; Itoh, Takeshi; Hishiki, Teruyoshi; Fukuchi, Satoshi; Nishikawa, Ken; Sugano, Sumio; Nomura, Nobuo; Tateno, Yoshio; Imanishi, Tadashi; Gojobori, Takashi

    2008-01-01

    Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein-protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group.

  20. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts*

    PubMed Central

    2008-01-01

    Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein–protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group. PMID:18089548

  1. Assessment of community-submitted ontology annotations from a novel database-journal partnership

    PubMed Central

    Berardini, Tanya Z.; Li, Donghui; Muller, Robert; Chetty, Raymond; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

    2012-01-01

    As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles' contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed. We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality. Database URL: www.arabidopsis.org PMID:22859749

  2. Assessment of community-submitted ontology annotations from a novel database-journal partnership.

    PubMed

    Berardini, Tanya Z; Li, Donghui; Muller, Robert; Chetty, Raymond; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

    2012-01-01

    As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles' contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed. We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality. Database URL: www.arabidopsis.org.

  3. SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation

    PubMed Central

    Piovesan, Damiano; Profiti, Giuseppe; Martelli, Pier Luigi; Fariselli, Piero; Fontanesi, Luca; Casadio, Rita

    2013-01-01

    Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics. Database URL: http://bar.biocomp.unibo.it/pig/ PMID:24065691

  4. CHPVDB ‐ a sequence annotation database for Chandipura Virus

    PubMed Central

    Dikhit, Manas Ranjan; Rana, Sindhu Prava; Das, Pradeep; Sahoo, Ganesh Chandra

    2009-01-01

    Databases containing proteomic information have become indispensable for virology studies. As the gap between the amount of sequence information and functional characterization widens, increasing efforts are being directed to the development of databases. For virologist, it is therefore desirable to have a single data collection point which integrates research related data from different domains. CHPVDB is our effort to provide virologist such a one‐step information center. We describe herein the creation of CHPVDB, a new database that integrates information of different proteins in to a single resource. For basic curation of protein information, the database relies on features from other selected databases, servers and published reports. This database facilitates significant relationship between molecular analysis, cleavage sites, possible protein functional families assigned to different proteins of Chandipura virus (CHPV) by SVMProt and related tools. Availability The database is freely available at http://chpvdb.biomedinformri.org/. PMID:19293996

  5. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models

    PubMed Central

    2010-01-01

    Background Quantitative models of biochemical and cellular systems are used to answer a variety of questions in the biological sciences. The number of published quantitative models is growing steadily thanks to increasing interest in the use of models as well as the development of improved software systems and the availability of better, cheaper computer hardware. To maximise the benefits of this growing body of models, the field needs centralised model repositories that will encourage, facilitate and promote model dissemination and reuse. Ideally, the models stored in these repositories should be extensively tested and encoded in community-supported and standardised formats. In addition, the models and their components should be cross-referenced with other resources in order to allow their unambiguous identification. Description BioModels Database http://www.ebi.ac.uk/biomodels/ is aimed at addressing exactly these needs. It is a freely-accessible online resource for storing, viewing, retrieving, and analysing published, peer-reviewed quantitative models of biochemical and cellular systems. The structure and behaviour of each simulation model distributed by BioModels Database are thoroughly checked; in addition, model elements are annotated with terms from controlled vocabularies as well as linked to relevant data resources. Models can be examined online or downloaded in various formats. Reaction network diagrams generated from the models are also available in several formats. BioModels Database also provides features such as online simulation and the extraction of components from large scale models into smaller submodels. Finally, the system provides a range of web services that external software systems can use to access up-to-date data from the database. Conclusions BioModels Database has become a recognised reference resource for systems biology. It is being used by the community in a variety of ways; for example, it is used to benchmark different simulation

  6. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.

    PubMed

    Haas, Brian J; Delcher, Arthur L; Mount, Stephen M; Wortman, Jennifer R; Smith, Roger K; Hannick, Linda I; Maiti, Rama; Ronning, Catherine M; Rusch, Douglas B; Town, Christopher D; Salzberg, Steven L; White, Owen

    2003-10-01

    The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

  7. SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms

    PubMed Central

    Chelala, Claude; Khan, Arshad; Lemoine, Nicholas R

    2009-01-01

    Motivation: Design a new computational tool allowing scientists to functionally annotate newly discovered and public domain single nucleotide polymorphisms in order to help in prioritizing targets in further disease studies and large-scale genotyping projects. Summary: SNPnexus database provides functional annotation for both novel and public SNPs. Possible effects on the transcriptome and proteome levels are characterized and reported from five major annotation systems providing the most extensive information on alternative splicing. Additional information on HapMap genotype and allele frequency, overlaps with potential regulatory elements or structural variations as well as related genetic diseases can be also retrieved. The SNPnexus database has a user-friendly web interface, providing single or batch query options using SNP identifiers from dbSNP as well as genomic location on clones, contigs or chromosomes. Therefore, SNPnexus is the only database currently providing a complete set of functional annotations of SNPs in public databases and newly detected from sequencing projects. Hence, we describe SNPnexus, provide details of the query options, the annotation categories as well as biological examples of use. Availability: The SNPnexus database is freely available at http://www.snp-nexus.org. Contact: claude.chelala@cancer.org.uk PMID:19098027

  8. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

    PubMed

    O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D

    2016-01-04

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  9. GBshape: a genome browser database for DNA shape annotations.

    PubMed

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J; Parker, Stephen C J; Nuzhdin, Sergey V; Tullius, Thomas D; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.

  10. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation

    PubMed Central

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific

  11. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific

  12. CEBS: a comprehensive annotated database of toxicological data

    PubMed Central

    Lea, Isabel A.; Gong, Hui; Paleja, Anand; Rashid, Asif; Fostel, Jennifer

    2017-01-01

    The Chemical Effects in Biological Systems database (CEBS) is a comprehensive and unique toxicology resource that compiles individual and summary animal data from the National Toxicology Program (NTP) testing program and other depositors into a single electronic repository. CEBS has undergone significant updates in recent years and currently contains over 11 000 test articles (exposure agents) and over 8000 studies including all available NTP carcinogenicity, short-term toxicity and genetic toxicity studies. Study data provided to CEBS are manually curated, accessioned and subject to quality assurance review prior to release to ensure high quality. The CEBS database has two main components: data collection and data delivery. To accommodate the breadth of data produced by NTP, the CEBS data collection component is an integrated relational design that allows the flexibility to capture any type of electronic data (to date). The data delivery component of the database comprises a series of dedicated user interface tables containing pre-processed data that support each component of the user interface. The user interface has been updated to include a series of nine Guided Search tools that allow access to NTP summary and conclusion data and larger non-NTP datasets. The CEBS database can be accessed online at http://www.niehs.nih.gov/research/resources/databases/cebs/. PMID:27899660

  13. DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis.

    PubMed

    Sherman, Brad T; Huang, Da Wei; Tan, Qina; Guo, Yongjian; Bour, Stephan; Liu, David; Stephens, Robert; Baseler, Michael W; Lane, H Clifford; Lempicki, Richard A

    2007-11-02

    Due to the complex and distributed nature of biological research, our current biological knowledge is spread over many redundant annotation databases maintained by many independent groups. Analysts usually need to visit many of these bioinformatics databases in order to integrate comprehensive annotation information for their genes, which becomes one of the bottlenecks, particularly for the analytic task associated with a large gene list. Thus, a highly centralized and ready-to-use gene-annotation knowledgebase is in demand for high throughput gene functional analysis. The DAVID Knowledgebase is built around the DAVID Gene Concept, a single-linkage method to agglomerate tens of millions of gene/protein identifiers from a variety of public genomic resources into DAVID gene clusters. The grouping of such identifiers improves the cross-reference capability, particularly across NCBI and UniProt systems, enabling more than 40 publicly available functional annotation sources to be comprehensively integrated and centralized by the DAVID gene clusters. The simple, pair-wise, text format files which make up the DAVID Knowledgebase are freely downloadable for various data analysis uses. In addition, a well organized web interface allows users to query different types of heterogeneous annotations in a high-throughput manner. The DAVID Knowledgebase is designed to facilitate high throughput gene functional analysis. For a given gene list, it not only provides the quick accessibility to a wide range of heterogeneous annotation data in a centralized location, but also enriches the level of biological information for an individual gene. Moreover, the entire DAVID Knowledgebase is freely downloadable or searchable at http://david.abcc.ncifcrf.gov/knowledgebase/.

  14. The Flux: creating a large annotated image database

    NASA Astrophysics Data System (ADS)

    Tamburrino, Daniel; Schönmann, Patrick; Vandewalle, Patrick; Süsstrunk, Sabine

    2008-01-01

    From image retrieval to image classification, all research shares one common requirement: a good image database to test or train the algorithms. In order to create a large database of images, we set up a project that allowed gathering a collection of more than 33000 photographs with keywords and tags from all over the world. This project was part of the "We Are All Photographers Now!" exhibition at the Musee de l'Elysee in Lausanne, Switzerland. The "Flux," as it was called, gave all photographers, professional or amateur, the opportunity to have their images shown in the museum. Anyone could upload pictures on a website. We required that some simple tags were filled in. Keywords were optional. The information was collected in a MySQL database along with the original photos. The pictures were projected at the museum in five second intervals. A webcam snapshot was taken and sent back to the photographers via email to show how and when their image was displayed at the museum. During the 14 weeks of the exhibition, we collected more than 33000 JPEG pictures with tags and keywords. These pictures come from 133 countries and were taken by 9042 different photographers. This database can be used for non-commercial research at EPFL. We present some preliminary analysis here.

  15. A Tunable Machine Vision-based Strategy for Automated Annotation of Chemical Databases

    PubMed Central

    Park, Jungkap; Rosania, Gus R.; Saitou, Kazuhiro

    2010-01-01

    We present a tunable, machine vision-based strategy for automated annotation of virtual small molecule databases. The proposed strategy is based on the use of a machine vision based tool for extracting structure diagrams in research articles and converting them into connection tables, a virtual “Chemical Expert” system for screening the converted structures based on the adjustable levels of estimated conversion accuracy, and a fragment-based measure for calculating intermolecular similarity. For annotation, calculated chemical similarity between the converted structures and entries in a virtual small molecule database is used to establish the links. The overall annotation performances can be tuned by adjusting the cutoff threshold of the estimated conversion accuracy. We performed an annotation test which attempts to link 121 journal articles registered in the PubMed to entries in the PubChem which is the largest, publicly accessible chemical database. Two cases of tests are performed and their results are compared to see how the overall annotation performances are affected by the different threshold levels of the estimated accuracy of the converted structure. Our work demonstrates that over 45% of articles could have true positive links to entries in the PubChem database with promising recall and precision rates in both tests. Furthermore, we illustrates that Chemical Expert system which can screen the converted structures based on the adjustable levels of estimated conversion accuracy is a key factor impacting the overall annotation performance. We propose that this machine vision based strategy can be incorporated with the text-mining approach to facilitate extraction of contextual scientific knowledge about a chemical structure, from the scientific literature. PMID:19621901

  16. SIMAP—the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage

    PubMed Central

    Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas

    2014-01-01

    The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881

  17. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.

    PubMed

    Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2012-01-01

    The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org.

  18. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

    PubMed

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-12-01

    Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/

  19. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database

    PubMed Central

    Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G.; Parkhill, Julian; Rajandream, Marie-Adèle

    2008-01-01

    Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/ Contact: artemis@sanger.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:18845581

  20. Non-redundant patent sequence databases with value-added annotations at two levels

    PubMed Central

    Li, Weizhong; McWilliam, Hamish; de la Torre, Ana Richart; Grodowski, Adam; Benediktovich, Irina; Goujon, Mickael; Nauche, Stephane; Lopez, Rodrigo

    2010-01-01

    The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/. PMID:19884134

  1. Non-redundant patent sequence databases with value-added annotations at two levels.

    PubMed

    Li, Weizhong; McWilliam, Hamish; de la Torre, Ana Richart; Grodowski, Adam; Benediktovich, Irina; Goujon, Mickael; Nauche, Stephane; Lopez, Rodrigo

    2010-01-01

    The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/.

  2. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported

  3. The Co-regulation Data Harvester: Automating gene annotation starting from a transcriptome database

    NASA Astrophysics Data System (ADS)

    Tsypin, Lev M.; Turkewitz, Aaron P.

    Identifying co-regulated genes provides a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, a process much slower than genome sequencing per se. Tetrahymena thermophila, a unicellular eukaryote, has been a useful model organism and has a fully sequenced but sparsely annotated genome. One important resource for studying this organism has been an online transcriptomic database. We have developed an automated approach to gene annotation in the context of transcriptome data in T. thermophila, called the Co-regulation Data Harvester (CDH). Beginning with a gene of interest, the CDH identifies co-regulated genes by accessing the Tetrahymena transcriptome database. It then identifies their closely related genes (orthologs) in other organisms by using reciprocal BLAST searches. Finally, it collates the annotations of those orthologs' functions, which provides the user with information to help predict the cellular role of the initial query. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.

  4. CSE database: extended annotations and new recommendations for ECG software testing.

    PubMed

    Smíšek, Radovan; Maršánová, Lucie; Němcová, Andrea; Vítek, Martin; Kozumplík, Jiří; Nováková, Marie

    2016-12-31

    Nowadays, cardiovascular diseases represent the most common cause of death in western countries. Among various examination techniques, electrocardiography (ECG) is still a highly valuable tool used for the diagnosis of many cardiovascular disorders. In order to diagnose a person based on ECG, cardiologists can use automatic diagnostic algorithms. Research in this area is still necessary. In order to compare various algorithms correctly, it is necessary to test them on standard annotated databases, such as the Common Standards for Quantitative Electrocardiography (CSE) database. According to Scopus, the CSE database is the second most cited standard database. There were two main objectives in this work. First, new diagnoses were added to the CSE database, which extended its original annotations. Second, new recommendations for diagnostic software quality estimation were established. The ECG recordings were diagnosed by five new cardiologists independently, and in total, 59 different diagnoses were found. Such a large number of diagnoses is unique, even in terms of standard databases. Based on the cardiologists' diagnoses, a four-round consensus (4R consensus) was established. Such a 4R consensus means a correct final diagnosis, which should ideally be the output of any tested classification software. The accuracy of the cardiologists' diagnoses compared with the 4R consensus was the basis for the establishment of accuracy recommendations. The accuracy was determined in terms of sensitivity = 79.20-86.81%, positive predictive value = 79.10-87.11%, and the Jaccard coefficient = 72.21-81.14%, respectively. Within these ranges, the accuracy of the software is comparable with the accuracy of cardiologists. The accuracy quantification of the correct classification is unique. Diagnostic software developers can objectively evaluate the success of their algorithm and promote its further development. The annotations and recommendations proposed in this work will allow

  5. Annotated checklist and database for vascular plants of the Jemez Mountains

    SciTech Connect

    Foxx, T. S.; Pierce, L.; Tierney, G. D.; Hansen, L. A.

    1998-03-01

    Studies done in the last 40 years have provided information to construct a checklist of the Jemez Mountains. The present database and checklist builds on the basic list compiled by Teralene Foxx and Gail Tierney in the early 1980s. The checklist is annotated with taxonomic information, geographic and biological information, economic uses, wildlife cover, revegetation potential, and ethnographic uses. There are nearly 1000 species that have been noted for the Jemez Mountains. This list is cross-referenced with the US Department of Agriculture Natural Resource Conservation Service PLANTS database species names and acronyms. All information will soon be available on a Web Page.

  6. The Lung Image Database Consortium (LIDC) data collection process for nodule detection and annotation.

    PubMed

    McNitt-Gray, Michael F; Armato, Samuel G; Meyer, Charles R; Reeves, Anthony P; McLennan, Geoffrey; Pais, Richie C; Freymann, John; Brown, Matthew S; Engelmann, Roger M; Bland, Peyton H; Laderach, Gary E; Piker, Chris; Guo, Junfeng; Towfic, Zaid; Qing, David P-Y; Yankelevitz, David F; Aberle, Denise R; van Beek, Edwin J R; MacMahon, Heber; Kazerooni, Ella A; Croft, Barbara Y; Clarke, Laurence P

    2007-12-01

    The Lung Image Database Consortium (LIDC) is developing a publicly available database of thoracic computed tomography (CT) scans as a medical imaging research resource to promote the development of computer-aided detection or characterization of pulmonary nodules. To obtain the best estimate of the location and spatial extent of lung nodules, expert thoracic radiologists reviewed and annotated each scan. Because a consensus panel approach was neither feasible nor desirable, a unique two-phase, multicenter data collection process was developed to allow multiple radiologists at different centers to asynchronously review and annotate each CT scan. This data collection process was also intended to capture the variability among readers. Four radiologists reviewed each scan using the following process. In the first or "blinded" phase, each radiologist reviewed the CT scan independently. In the second or "unblinded" review phase, results from all four blinded reviews were compiled and presented to each radiologist for a second review, allowing the radiologists to review their own annotations together with the annotations of the other radiologists. The results of each radiologist's unblinded review were compiled to form the final unblinded review. An XML-based message system was developed to communicate the results of each reading. This two-phase data collection process was designed, tested, and implemented across the LIDC. More than 500 CT scans have been read and annotated using this method by four expert readers; these scans either are currently publicly available at http://ncia.nci.nih.gov or will be in the near future. A unique data collection process was developed, tested, and implemented that allowed multiple readers at distributed sites to asynchronously review CT scans multiple times. This process captured the opinions of each reader regarding the location and spatial extent of lung nodules.

  7. circRNADb: A comprehensive database for human circular RNAs with protein-coding annotations

    PubMed Central

    Chen, Xiaoping; Han, Ping; Zhou, Tao; Guo, Xuejiang; Song, Xiaofeng; Li, Yan

    2016-01-01

    It has been known that circular RNAs are widely expressed in human tissues and cells, and play important regulatory roles in physiological or pathological processes. However, there is lack of comprehensively annotated human circular RNAs database. In this study we established a circRNA database, named as circRNADb, containing 32,914 human exonic circRNAs carefully selected from diversified sources. The detailed information of the circRNA, including genomic information, exon splicing, genome sequence, internal ribosome entry site (IRES), open reading frame (ORF) and references were provided in circRNADb. In addition, circRNAs were found to be able to encode proteins, which have not been reported in any species. 16328 circRNAs were annotated to have ORF longer than 100 amino acids, of which 7170 have IRES elements. 46 circRNAs from 37 genes were found to have their corresponding proteins expressed according mass spectrometry. The database provides the function of data search, browse, download, submit and feedback for the user to study particular circular RNA of interest and update the database continually. circRNADb will be built to be a biological information platform for circRNA molecules and related biological functions in the future. The database can be freely available through the web server at http://reprod.njmu.edu.cn/circrnadb. PMID:27725737

  8. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

    PubMed

    Schnoes, Alexandra M; Brown, Shoshana D; Dodevski, Igor; Babbitt, Patricia C

    2009-12-01

    Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

  9. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database

    PubMed Central

    Drabkin, Harold J.; Blake, Judith A.

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations

  10. Biological Database of Images and Genomes: tools for community annotations linking image and genomic information

    PubMed Central

    Oberlin, Andrew T; Jurkovic, Dominika A; Balish, Mitchell F; Friedberg, Iddo

    2013-01-01

    Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas. Database URL: BioDIG website: http://biodig.org BioDIG source code repository: http://github.com/FriedbergLab/BioDIG The MyDIG database: http://mydig.biodig.org/ PMID:23550062

  11. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-12-02

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.

  12. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  13. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  14. Design and implementation of a database for Brucella melitensis genome annotation.

    PubMed

    De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

    2008-03-18

    The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.

  15. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  16. LNCipedia: a database for annotated human lncRNA transcript sequences and structures

    PubMed Central

    Volders, Pieter-Jan; Helsens, Kenny; Wang, Xiaowei; Menten, Björn; Martens, Lennart; Gevaert, Kris; Vandesompele, Jo; Mestdagh, Pieter

    2013-01-01

    Here, we present LNCipedia (http://www.lncipedia.org), a novel database for human long non-coding RNA (lncRNA) transcripts and genes. LncRNAs constitute a large and diverse class of non-coding RNA genes. Although several lncRNAs have been functionally annotated, the majority remains to be characterized. Different high-throughput methods to identify new lncRNAs (including RNA sequencing and annotation of chromatin-state maps) have been applied in various studies resulting in multiple unrelated lncRNA data sets. LNCipedia offers 21 488 annotated human lncRNA transcripts obtained from different sources. In addition to basic transcript information and gene structure, several statistics are determined for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites. Our analyses suggest that, much like microRNAs, many lncRNAs have a significant secondary structure, in-line with their presumed association with proteins or protein complexes. Available literature on specific lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. In addition, a novel strategy has been integrated for detecting potentially coding lncRNAs by automatically re-analysing the large body of publicly available mass spectrometry data in the PRIDE database. LNCipedia is publicly available and allows users to query and download lncRNA sequences and structures based on different search criteria. The database may serve as a resource to initiate small- and large-scale lncRNA studies. As an example, the LNCipedia content was used to develop a custom microarray for expression profiling of all available lncRNAs. PMID:23042674

  17. BambooGDB: a bamboo genome database with functional annotation and an analysis platform.

    PubMed

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.

  18. RefSeq microbial genomes database: new representation and annotation strategy

    PubMed Central

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O’Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks. PMID:24316578

  19. RefSeq microbial genomes database: new representation and annotation strategy.

    PubMed

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.

  20. Construction of Customized Sub-Databases from NCBI-nr Database for Rapid Annotation of Huge Metagenomic Datasets Using a Combined BLAST and MEGAN Approach

    PubMed Central

    Yu, Ke; Zhang, Tong

    2013-01-01

    We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1st BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1st BLAST were subjected to the 2nd BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1st BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ∼150–385 times at the BLAST e-value of 1e–5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e–10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks. PMID:23573212

  1. Construction of customized sub-databases from NCBI-nr database for rapid annotation of huge metagenomic datasets using a combined BLAST and MEGAN approach.

    PubMed

    Yu, Ke; Zhang, Tong

    2013-01-01

    We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1(st) BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1(st) BLAST were subjected to the 2(nd) BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1(st) BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ~150-385 times at the BLAST e-value of 1e-5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e-10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks.

  2. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects.

    PubMed

    Holt, Carson; Yandell, Mark

    2011-12-22

    Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.

  3. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    PubMed Central

    2011-01-01

    Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets. PMID:22192575

  4. Analysis of Annotated Data Models for Improving Data Quality.

    PubMed

    Ulrich, Hannes; Kock-Schoppenhauer, Ann-Kristin; Andersen, Björn; Ingenerf, Josef

    2017-01-01

    The public Medical Data Models (MDM) portal with more than 9.000 annotated forms from clinical trials and other sources provides many research opportunities for the medical informatics community. It is mainly used to address the problem of heterogeneity by searching, mediating, reusing, and assessing data models, e. g. the semi-interactive curation of core data records in a special domain. Furthermore, it can be used as a benchmark for evaluating algorithms that create, transform, annotate, and analyse structured patient data. Using CDISC ODM for syntactically representing all data models in the MDM portal, there are semi-automatically added UMLS CUIs at several ODM levels like ItemGroupDef, ItemDef, or CodeList item. This can improve the interpretability and processability of the received information, but only if the coded information is correct and reliable. This raises the question how to assure that semantically similar datasets are also processed and classified similarly. In this work, a (semi-)automatic approach to analyse and assess items, questions, and data elements in clinical studies is described. The approach uses a hybrid evaluation process to rate and propose semantic annotations for under-specified trial items. The evaluation algorithm operates with the commonly used NLM MetaMap to provide UMLS support and corpus-based proposal algorithms to link datasets from the provided CDISC ODM item pool.

  5. Methods for eliciting, annotating, and analyzing databases for child speech development.

    PubMed

    Beckman, Mary E; Plummer, Andrew R; Munson, Benjamin; Reidy, Patrick F

    2017-09-01

    Methods from automatic speech recognition (ASR), such as segmentation and forced alignment, have facilitated the rapid annotation and analysis of very large adult speech databases and databases of caregiver-infant interaction, enabling advances in speech science that were unimaginable just a few decades ago. This paper centers on two main problems that must be addressed in order to have analogous resources for developing and exploiting databases of young children's speech. The first problem is to understand and appreciate the differences between adult and child speech that cause ASR models developed for adult speech to fail when applied to child speech. These differences include the fact that children's vocal tracts are smaller than those of adult males and also changing rapidly in size and shape over the course of development, leading to between-talker variability across age groups that dwarfs the between-talker differences between adult men and women. Moreover, children do not achieve fully adult-like speech motor control until they are young adults, and their vocabularies and phonological proficiency are developing as well, leading to considerably more within-talker variability as well as more between-talker variability. The second problem then is to determine what annotation schemas and analysis techniques can most usefully capture relevant aspects of this variability. Indeed, standard acoustic characterizations applied to child speech reveal that adult-centered annotation schemas fail to capture phenomena such as the emergence of covert contrasts in children's developing phonological systems, while also revealing children's nonuniform progression toward community speech norms as they acquire the phonological systems of their native languages. Both problems point to the need for more basic research into the growth and development of the articulatory system (as well as of the lexicon and phonological system) that is oriented explicitly toward the construction of

  6. Designing of Ground Truth Annotated DBT-TU-JU Breast Thermogram Database towards Early Abnormality Prediction.

    PubMed

    Bhowmik, Mrinal Kanti; Gogoi, Usha Rani; Majumdar, Gautam; Bhattacharjee, Debotosh; Datta, Dhritiman; Ghosh, Anjan Kumar

    2017-08-17

    The advancement of research in a specific area of clinical diagnosis crucially depends on the availability and quality of the radiology and other test related databases accompanied by ground truth and additional necessary medical findings. The paper describes the creation of the Department of Biotechnology-Tripura University-Jadavpur University (DBT-TU-JU) breast thermogram database. The objective of creating the DBT-TU-JU database is to provide a breast thermogram database that is annotated with the ground truth images of the suspicious regions. Along with the result of breast thermography, the database comprises of the results of other breast imaging methodologies. A standard breast thermogram acquisition protocol suite comprising of several critical factors has been designed for the collection of breast thermograms. Currently, the DBT-TU-JU database contains 1100 breast thermograms of 100 subjects. Due to the necessity of evaluating any breast abnormality detection system, this study emphasizes the generation of the ground truth images of the hotspot areas, whose presence in a breast thermogram signifies the presence of breast abnormality. With the generated ground truth images, we compared the results of six state-of-the-art image segmentation methods using five supervised evaluation metrics to identify the proficient segmentation methods for hotspot extraction. Based on the evaluation results, the Fractional-Order Darwinian particle swarm optimization, Region growing, Mean shift and Fuzzy c-means clustering are found to be more efficient in comparison to k-means clustering and Threshold based segmentation methods.

  7. ASAP: automated sequence annotation pipeline for web-based updating of sequence information with a local dynamic database.

    PubMed

    Kossenkov, Andrew; Manion, Frank J; Korotkov, Eugene; Moloshok, Thomas D; Ochs, Michael F

    2003-03-22

    The automated sequence annotation pipeline (ASAP) is designed to ease routine investigation of new functional annotations on unknown sequences, such as expressed sequence tags (ESTs), through querying of web-accessible resources and maintenance of a local database. The system allows easy use of the output from one search as the input for a new search, as well as the filtering of results. The database is used to store formats and parameters and information for parsing data from web sites. The database permits easy updating of format information should a site modify the format of a query or of a returned web page.

  8. modbase, a database of annotated comparative protein structure models and associated resources

    PubMed Central

    Pieper, Ursula; Eswar, Narayanan; Webb, Ben M.; Eramian, David; Kelly, Libusha; Barkan, David T.; Carter, Hannah; Mankoo, Parminder; Karchin, Rachel; Marti-Renom, Marc A.; Davis, Fred P.; Sali, Andrej

    2009-01-01

    MODBASE (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by MODPIPE, an automated modeling pipeline that relies primarily on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE currently contains 5 152 695 reliable models for domains in 1 593 209 unique protein sequences; only models based on statistically significant alignments and/or models assessed to have the correct fold are included. MODBASE also allows users to calculate comparative models on demand, through an interface to the MODWEB modeling server (http://salilab.org/modweb). Other resources integrated with MODBASE include databases of multiple protein structure alignments (DBAli), structurally defined ligand binding sites (LIGBASE), predicted ligand binding sites (AnnoLyze), structurally defined binary domain interfaces (PIBASE) and annotated single nucleotide polymorphisms and somatic mutations found in human proteins (LS-SNP, LS-Mut). MODBASE models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/). PMID:18948282

  9. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.

    PubMed

    Zou, Quan; Guo, Jiasheng; Ju, Ying; Wu, Meihong; Zeng, Xiangxiang; Hong, Zhiling

    2015-11-01

    tRNAScan-SE is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNAScan-SE is unacceptable for large sequences. Here, we used a machine learning method to try to improve the tRNAScan-SE results. A new predictor, tRNA-Predict, was designed. We obtained real and pseudo-tRNA sequences as training data sets using tRNAScan-SE and constructed three different tRNA feature sets. We then set up an ensemble classifier, LibMutil, to predict tRNAs from the training data. The positive data set of 623 tRNA sequences was obtained from tRNAdb 2009 and the negative data set was the false positive tRNAs predicted by tRNAscan-SE. Our in silico experiments revealed a prediction accuracy rate of 95.1 % for tRNA-Predict using 10-fold cross-validation. tRNA-Predict was developed to distinguish functional tRNAs from pseudo-tRNAs rather than to predict tRNAs from a genome-wide scan. However, tRNA-Predict can work with the output of tRNAscan-SE, which is a genome-wide scanning method, to improve the tRNAscan-SE annotation results. The tRNA-Predict web server is accessible at http://datamining.xmu.edu.cn/∼gjs/tRNA-Predict.

  10. A statistical framework for improving genomic annotations of prokaryotic essential genes.

    PubMed

    Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

    2013-01-01

    Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our

  11. CyanoBase and RhizoBase: databases of manually curated annotations for cyanobacterial and rhizobial genomes.

    PubMed

    Fujisawa, Takatomo; Okamoto, Shinobu; Katayama, Toshiaki; Nakao, Mitsuteru; Yoshimura, Hidehisa; Kajiya-Kanegae, Hiromi; Yamamoto, Sumiko; Yano, Chiyoko; Yanaka, Yuka; Maita, Hiroko; Kaneko, Takakazu; Tabata, Satoshi; Nakamura, Yasukazu

    2014-01-01

    To understand newly sequenced genomes of closely related species, comprehensively curated reference genome databases are becoming increasingly important. We have extended CyanoBase (http://genome.microbedb.jp/cyanobase), a genome database for cyanobacteria, and newly developed RhizoBase (http://genome.microbedb.jp/rhizobase), a genome database for rhizobia, nitrogen-fixing bacteria associated with leguminous plants. Both databases focus on the representation and reusability of reference genome annotations, which are continuously updated by manual curation. Domain experts have extracted names, products and functions of each gene reported in the literature. To ensure effectiveness of this procedure, we developed the TogoAnnotation system offering a web-based user interface and a uniform storage of annotations for the curators of the CyanoBase and RhizoBase databases. The number of references investigated for CyanoBase increased from 2260 in our previous report to 5285, and for RhizoBase, we perused 1216 references. The results of these intensive annotations are displayed on the GeneView pages of each database. Advanced users can also retrieve this information through the representational state transfer-based web application programming interface in an automated manner.

  12. An Improved Database System for Program Assessment

    ERIC Educational Resources Information Center

    Haga, Wayne; Morris, Gerard; Morrell, Joseph S.

    2011-01-01

    This research paper presents a database management system for tracking course assessment data and reporting related outcomes for program assessment. It improves on a database system previously presented by the authors and in use for two years. The database system presented is specific to assessment for ABET (Accreditation Board for Engineering and…

  13. eDGAR: a database of Disease-Gene Associations with annotated Relationships among genes.

    PubMed

    Babbi, Giulia; Martelli, Pier Luigi; Profiti, Giuseppe; Bovo, Samuele; Savojardo, Castrense; Casadio, Rita

    2017-08-11

    Genetic investigations, boosted by modern sequencing techniques, allow dissecting the genetic component of different phenotypic traits. These efforts result in the compilation of lists of genes related to diseases and show that an increasing number of diseases is associated with multiple genes. Investigating functional relations among genes associated with the same disease contributes to highlighting molecular mechanisms of the pathogenesis. We present eDGAR, a database collecting and organizing the data on gene/disease associations as derived from OMIM, Humsavar and ClinVar. For each disease-associated gene, eDGAR collects information on its annotation. Specifically, for lists of genes, eDGAR provides information on: i) interactions retrieved from PDB, BIOGRID and STRING; ii) co-occurrence in stable and functional structural complexes; iii) shared Gene Ontology annotations; iv) shared KEGG and REACTOME pathways; v) enriched functional annotations computed with NET-GE; vi) regulatory interactions derived from TRRUST; vii) localization on chromosomes and/or co-localisation in neighboring loci. The present release of eDGAR includes 2672 diseases, related to 3658 different genes, for a total number of 5729 gene-disease associations. 71% of the genes are linked to 621 multigenic diseases and eDGAR highlights their common GO terms, KEGG/REACTOME pathways, physical and regulatory interactions. eDGAR includes a network based enrichment method for detecting statistically significant functional terms associated to groups of genes. eDGAR offers a resource to analyze disease-gene associations. In multigenic diseases genes can share physical interactions and/or co-occurrence in the same functional processes. eDGAR is freely available at: edgar.biocomp.unibo.it.

  14. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures.

    PubMed

    Paladin, Lisanna; Hirsh, Layla; Piovesan, Damiano; Andrade-Navarro, Miguel A; Kajava, Andrey V; Tosatto, Silvio C E

    2017-01-04

    RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  15. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures

    PubMed Central

    Paladin, Lisanna; Hirsh, Layla; Piovesan, Damiano; Andrade-Navarro, Miguel A.; Kajava, Andrey V.; Tosatto, Silvio C.E.

    2017-01-01

    RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity. PMID:27899671

  16. deepBase: a database for deeply annotating and mining deep sequencing data

    PubMed Central

    Yang, Jian-Hua; Shao, Peng; Zhou, Hui; Chen, Yue-Qin; Qu, Liang-Hu

    2010-01-01

    Advances in high-throughput next-generation sequencing technology have reshaped the transcriptomic research landscape. However, exploration of these massive data remains a daunting challenge. In this study, we describe a novel database, deepBase, which we have developed to facilitate the comprehensive annotation and discovery of small RNAs from transcriptomic data. The current release of deepBase contains deep sequencing data from 185 small RNA libraries from diverse tissues and cell lines of seven organisms: human, mouse, chicken, Ciona intestinalis, Drosophila melanogaster, Caenhorhabditis elegans and Arabidopsis thaliana. By analyzing ∼14.6 million unique reads that perfectly mapped to more than 284 million genomic loci, we annotated and identified ∼380 000 unique ncRNA-associated small RNAs (nasRNAs), ∼1.5 million unique promoter-associated small RNAs (pasRNAs), ∼4.0 million unique exon-associated small RNAs (easRNAs) and ∼6 million unique repeat-associated small RNAs (rasRNAs). Furthermore, 2038 miRNA and 1889 snoRNA candidates were predicted by miRDeep and snoSeeker. All of the mapped reads can be grouped into about 1.2 million RNA clusters. For the purpose of comparative analysis, deepBase provides an integrative, interactive and versatile display. A convenient search option, related publications and other useful information are also provided for further investigation. deepBase is available at: http://deepbase.sysu.edu.cn/. PMID:19966272

  17. EUCLID: automatic classification of proteins in functional classes by their database annotations.

    PubMed

    Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

    1998-01-01

    A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper

  18. SureChEMBL: a large-scale, chemically annotated patent document database

    PubMed Central

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

    2016-01-01

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922

  19. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

    PubMed

    Arloth, Janine; Bader, Daniel M; Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer's annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2.

  20. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture

    PubMed Central

    Ghaffari, Noushin; Sanchez-Flores, Alejandro; Doan, Ryan; Garcia-Orozco, Karina D.; Chen, Patricia L.; Ochoa-Leyva, Adrian; Lopez-Zavala, Alonso A.; Carrasco, J. Salvador; Hong, Chris; Brieba, Luis G.; Rudiño-Piñera, Enrique; Blood, Philip D.; Sawyer, Jason E.; Johnson, Charles D.; Dindot, Scott V.; Sotelo-Mundo, Rogerio R.; Criscitiello, Michael F.

    2014-01-01

    We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance. PMID:25420880

  1. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture.

    PubMed

    Ghaffari, Noushin; Sanchez-Flores, Alejandro; Doan, Ryan; Garcia-Orozco, Karina D; Chen, Patricia L; Ochoa-Leyva, Adrian; Lopez-Zavala, Alonso A; Carrasco, J Salvador; Hong, Chris; Brieba, Luis G; Rudiño-Piñera, Enrique; Blood, Philip D; Sawyer, Jason E; Johnson, Charles D; Dindot, Scott V; Sotelo-Mundo, Rogerio R; Criscitiello, Michael F

    2014-11-25

    We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance.

  2. Missing semantic annotation in databases. The root cause for data integration and migration problems in information systems.

    PubMed

    Dugas, M

    2014-01-01

    Data integration is a well-known grand challenge in information systems. It is highly relevant in medicine because of the multitude of patient data sources. Semantic annotations of data items regarding concept and value domain, based on comprehensive terminologies can facilitate data integration and migration. Therefore it should be implemented in databases from the very beginning.

  3. Automated system for gene annotation and metabolic pathway reconstruction using general sequence databases.

    PubMed

    Alves, João M P; Buck, Gregory A

    2007-11-01

    Despite the growing number of genomes published or currently being sequenced, there is a relative paucity of software for functional classification of newly discovered genes and their assignment to metabolic pathways. Available software for such analyses has a very steep learning curve and requires the installation, configuration, and maintenance of large amounts of complex infrastructure, including complementary software and databases. Many such tools are restricted to one or a few data sources and classification schemes. In this work, we report an automated system for gene annotation and metabolic pathway reconstruction (ASGARD), which was designed to be powerful and generalizable, yet simple for the biologist to install and run on centralized, commonly available computers. It avoids the requirement for complex resources such as relational databases and web servers, as well as the need for administrator access to the operating system. Our methodology contributes to a more rapid investigation of the potential biochemical capabilities of genes and genomes by the biological researcher, and is useful in biochemical as well as comparative and evolutionary studies of pathways and networks.

  4. A database for coconut crop improvement

    PubMed Central

    Rajagopal, Velamoor; Manimekalai, Ramaswamy; Devakumar, Krishnamurthy; Rajesh; Karun, Anitha; Niral, Vittal; Gopal, Murali; Aziz, Shamina; Gunasekaran, Marimuthu; Kumar, Mundappurathe Ramesh; Chandrasekar, Arumugam

    2005-01-01

    Coconut crop improvement requires a number of biotechnology and bioinformatics tools. A database containing information on CG (coconut germplasm), CCI (coconut cultivar identification), CD (coconut disease), MIFSPC (microbial information systems in plantation crops) and VO (vegetable oils) is described. The database was developed using MySQL and PostgreSQL running in Linux operating system. The database interface is developed in PHP, HTML and JAVA. Availability http://www.bioinfcpcri.org PMID:17597858

  5. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications

    PubMed Central

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H.; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene–patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene–patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at ; the information is updated bimonthly. PMID:17085479

  6. MITOS: improved de novo metazoan mitochondrial genome annotation.

    PubMed

    Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F

    2013-11-01

    About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de. Copyright © 2012 Elsevier Inc. All rights reserved.

  7. SureChEMBL: a large-scale, chemically annotated patent document database.

    PubMed

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P

    2016-01-04

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. Quality of computationally inferred gene ontology annotations.

    PubMed

    Skunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-05-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon-an important outcome given that >98% of all annotations are inferred without direct curation.

  9. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community

    PubMed Central

    Hosmani, Prashant S.; Villalobos-Ayala, Krystal; Miller, Sherry; Shippy, Teresa; Flores, Mirella; Rosendale, Andrew; Cordola, Chris; Bell, Tracey; Mann, Hannah; DeAvila, Gabe; DeAvila, Daniel; Moore, Zachary; Buller, Kyle; Ciolkevich, Kathryn; Nandyal, Samantha; Mahoney, Robert; Van Voorhis, Joshua; Dunlevy, Megan; Farrow, David; Hunter, David; Morgan, Taylar; Shore, Kayla; Guzman, Victoria; Izsak, Allison; Dixon, Danielle E.; Cridge, Andrew; Cano, Liliana; Cao, Xiaolong; Jiang, Haobo; Leng, Nan; Johnson, Shannon; Cantarel, Brandi L.; Richards, Stephen; English, Adam; Shatters, Robert G.; Childers, Chris; Chen, Mei-Ju; Hunter, Wayne; Cilia, Michelle; Mueller, Lukas A.; Munoz-Torres, Monica; Nelson, David; Poelchau, Monica F.; Benoit, Joshua B.; Wiersma-Koch, Helen; D’Elia, Tom; Brown, Susan J.

    2017-01-01

    Abstract The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. Database URL: https://citrusgreening.org/

  10. Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases.

    PubMed

    Wollbrett, Julien; Larmande, Pierre; de Lamotte, Frédéric; Ruiz, Manuel

    2013-04-15

    In recent years, a large amount of "-omics" data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.

  11. Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases

    PubMed Central

    2013-01-01

    Background In recent years, a large amount of “-omics” data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. Results We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. Conclusions BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic. PMID:23586394

  12. A Linked Data-Based Collaborative Annotation System for Increasing Learning Achievements

    ERIC Educational Resources Information Center

    Zarzour, Hafed; Sellami, Mokhtar

    2017-01-01

    With the emergence of the Web 2.0, collaborative annotation practices have become more mature in the field of learning. In this context, several recent studies have shown the powerful effects of the integration of annotation mechanism in learning process. However, most of these studies provide poor support for semantically structured resources,…

  13. Plant Proteogenomics: Improvements to the Grapevine Genome Annotation.

    PubMed

    Chapman, Brett; Bellgard, Matthew

    2017-09-12

    Grapevine is an important perennial fruit to the wine industry, and has implications for the health industry with some causative agents proven to reduce heart disease. Since the sequencing and assembly of grapevine cultivar Pinot Noir, several studies have contributed to its genome annotation. This new study further contributes towards genome annotation efforts by conducting a proteogenomics analysis using the latest genome annotation from CRIBI, legacy proteomics dataset from cultivar Cabernet Sauvignon and a large RNA-seq dataset. A total of 341 novel annotation events were identified consisting of 5 frame-shifts, 37 translated UTRs, 15 exon boundaries, 1 novel splice, 9 novel exons, 159 gene boundaries, 112 reverse strands and 1 novel gene event in 213 genes and 323 proteins. From this proteogenomics evidence, the Augustus gene prediction tool predicted 52 novel and revised genes (54 protein isoforms), 11 genes of which were associated with key traits such as stress tolerance and floral and fruity wine characteristics. This study also highlighted a likely over-assembly with the genome, particularly on chromosome 7. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  14. ModBase, a database of annotated comparative protein structure models and associated resources

    PubMed Central

    Pieper, Ursula; Webb, Benjamin M.; Dong, Guang Qiang; Schneidman-Duhovny, Dina; Fan, Hao; Kim, Seung Joong; Khuri, Natalia; Spill, Yannick G.; Weinkam, Patrick; Hammel, Michal; Tainer, John A.; Nilges, Michael; Sali, Andrej

    2014-01-01

    ModBase (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by ModPipe, an automated modeling pipeline that relies primarily on Modeller for fold assignment, sequence-structure alignment, model building and model assessment (http://salilab.org/modeller/). ModBase currently contains almost 30 million reliable models for domains in 4.7 million unique protein sequences. ModBase allows users to compute or update comparative models on demand, through an interface to the ModWeb modeling server (http://salilab.org/modweb). ModBase models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/). Recently developed associated resources include the AllosMod server for modeling ligand-induced protein dynamics (http://salilab.org/allosmod), the AllosMod-FoXS server for predicting a structural ensemble that fits an SAXS profile (http://salilab.org/allosmod-foxs), the FoXSDock server for protein–protein docking filtered by an SAXS profile (http://salilab.org/foxsdock), the SAXS Merge server for automatic merging of SAXS profiles (http://salilab.org/saxsmerge) and the Pose & Rank server for scoring protein–ligand complexes (http://salilab.org/poseandrank). In this update, we also highlight two applications of ModBase: a PSI:Biology initiative to maximize the structural coverage of the human alpha-helical transmembrane proteome and a determination of structural determinants of human immunodeficiency virus-1 protease specificity. PMID:24271400

  15. Modeling loosely annotated images using both given and imagined annotations

    NASA Astrophysics Data System (ADS)

    Tang, Hong; Boujemaa, Nozha; Chen, Yunhao; Deng, Lei

    2011-12-01

    In this paper, we present an approach to learn latent semantic analysis models from loosely annotated images for automatic image annotation and indexing. The given annotation in training images is loose due to: 1. ambiguous correspondences between visual features and annotated keywords; 2. incomplete lists of annotated keywords. The second reason motivates us to enrich the incomplete annotation in a simple way before learning a topic model. In particular, some ``imagined'' keywords are poured into the incomplete annotation through measuring similarity between keywords in terms of their co-occurrence. Then, both given and imagined annotations are employed to learn probabilistic topic models for automatically annotating new images. We conduct experiments on two image databases (i.e., Corel and ESP) coupled with their loose annotations, and compare the proposed method with state-of-the-art discrete annotation methods. The proposed method improves word-driven probability latent semantic analysis (PLSA-words) up to a comparable performance with the best discrete annotation method, while a merit of PLSA-words is still kept, i.e., a wider semantic range.

  16. Improved orthologous databases to ease protozoan targets inference.

    PubMed

    Kotowski, Nelson; Jardim, Rodrigo; Dávila, Alberto M R

    2015-09-29

    Homology inference helps on identifying similarities, as well as differences among organisms, which provides a better insight on how closely related one might be to another. In addition, comparative genomics pipelines are widely adopted tools designed using different bioinformatics applications and algorithms. In this article, we propose a methodology to build improved orthologous databases with the potential to aid on protozoan target identification, one of the many tasks which benefit from comparative genomics tools. Our analyses are based on OrthoSearch, a comparative genomics pipeline originally designed to infer orthologs through protein-profile comparison, supported by an HMM, reciprocal best hits based approach. Our methodology allows OrthoSearch to confront two orthologous databases and to generate an improved new one. Such can be later used to infer potential protozoan targets through a similarity analysis against the human genome. The protein sequences of Cryptosporidium hominis, Entamoeba histolytica and Leishmania infantum genomes were comparatively analyzed against three orthologous databases: (i) EggNOG KOG, (ii) ProtozoaDB and (iii) Kegg Orthology (KO). That allowed us to create two new orthologous databases, "KO + EggNOG KOG" and "KO + EggNOG KOG + ProtozoaDB", with 16,938 and 27,701 orthologous groups, respectively. Such new orthologous databases were used for a regular OrthoSearch run. By confronting "KO + EggNOG KOG" and "KO + EggNOG KOG + ProtozoaDB" databases and protozoan species we were able to detect the following total of orthologous groups and coverage (relation between the inferred orthologous groups and the species total number of proteins): Cryptosporidium hominis: 1,821 (11 %) and 3,254 (12 %); Entamoeba histolytica: 2,245 (13 %) and 5,305 (19 %); Leishmania infantum: 2,702 (16 %) and 4,760 (17 %). Using our HMM-based methodology and the largest created orthologous database, it was possible to infer 13

  17. Mendel-GFDb and Mendel-ESTS: databases of plant gene families and ESTs annotated with gene family numbers and gene family names

    PubMed Central

    Lonsdale, David; Crowe, Mark; Arnold, Benjamin; Arnold, Benedict C.

    2001-01-01

    There is no control over the information provided with sequences when they are deposited in the sequence databases. Consequently mistakes can seed the incorrect annotation of other sequences. Grouping genes into families and applying controlled annotation overcomes the problems of incorrect annotation associated with individual sequences. Two databases (http://www.mendel.ac.uk) were created to apply controlled annotation to plant genes and plant ESTs: Mendel-GFDb is a database of plant protein (gene) families based on gapped-BLAST analysis of all sequences in the SWISS-PROT family of databases. Sequences are aligned (ClustalW) and identical and similar residues shaded. The families are visually curated to ensure that one or more criteria, for example overall relatedness and/or domain similarity relate all sequences within a family. Sequence families are assigned a ‘Gene Family Number’ and a unified description is developed which best describes the family and its members. If authority exists the gene family is assigned a ‘Gene Family Name’. This information is placed in Mendel-GFDb. Mendel-ESTS is primarily a database of plant ESTs, which have been compared to Mendel-GFDb, completely sequenced genomes and domain databases. This approach associated ESTs with individual sequences and the controlled annotation of gene families and protein domains; the information being placed in Mendel-ESTS. The controlled annotation applied to genes and ESTs provides a basis from which a plant transcription database can be developed. PMID:11125066

  18. Genome-Wide Enzyme Annotation with Precision Control: Catalytic Families (CatFam) Databases

    DTIC Science & Technology

    2008-01-01

    classification, decision trees , association rules, neural networks,18 and support vector machines,19,20 to classify protein catalytic func- tions using various...genomes To evaluate the performance of CatFam for whole ge- nome annotation, we select two Yersinia genomes [Y. pes- tis mediaevails (ypm) and Y...pestis and F. tularensis Organism Annotation Number of enzyme-catalyzed reactions Number of predicted pathways Number of pathways with holes Yersinia

  19. Strategies to improve reference databases for soil microbiomes

    SciTech Connect

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas; Cardenas, Erick; Garoutte, Aaron; Williams, Ryan; Flater, Jared; Tiedje, James M.; Hofmockel, Kirsten S.; Gelder, Brian; Howe, Adina

    2016-12-09

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq, as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.

  20. Strategies to improve reference databases for soil microbiomes

    DOE PAGES

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas; ...

    2016-12-09

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq,more » as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.« less

  1. The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012

    PubMed Central

    Attwood, Teresa K.; Coletta, Alain; Muirhead, Gareth; Pavlopoulou, Athanasia; Philippou, Peter B.; Popov, Ivan; Romá-Mateo, Carlos; Theodosiou, Athina; Mitchell, Alex L.

    2012-01-01

    The PRINTS database, now in its 21st year, houses a collection of diagnostic protein family ‘fingerprints’. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. As such, they may be used to assign uncharacterized sequences to known families, and hence to infer tentative functional, structural and/or evolutionary relationships. The February 2012 release (version 42.0) includes 2156 fingerprints, encoding 12 444 individual motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Here, we report the current status of the database, and introduce a number of recent developments that help both to render a variety of our annotation and analysis tools easier to use and to make them more widely available. Database URL: www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/ PMID:22508994

  2. Improving HIV proteome annotation: new features of BioAfrica HIV Proteomics Resource

    PubMed Central

    Druce, Megan; Hulo, Chantal; Masson, Patrick; Sommer, Paula; Xenarios, Ioannis; Le Mercier, Philippe; De Oliveira, Tulio

    2016-01-01

    The Human Immunodeficiency Virus (HIV) is one of the pathogens that cause the greatest global concern, with approximately 35 million people currently infected with HIV. Extensive HIV research has been performed, generating a large amount of HIV and host genomic data. However, no effective vaccine that protects the host from HIV infection is available and HIV is still spreading at an alarming rate, despite effective antiretroviral (ARV) treatment. In order to develop effective therapies, we need to expand our knowledge of the interaction between HIV and host proteins. In contrast to virus proteins, which often rapidly evolve drug resistance mutations, the host proteins are essentially invariant within all humans. Thus, if we can identify the host proteins needed for virus replication, such as those involved in transporting viral proteins to the cell surface, we have a chance of interrupting viral replication. There is no proteome resource that summarizes this interaction, making research on this subject a difficult enterprise. In order to fill this gap in knowledge, we curated a resource presents detailed annotation on the interaction between the HIV proteome and host proteins. Our resource was produced in collaboration with ViralZone and used manual curation techniques developed by UniProtKB/Swiss-Prot. Our new website also used previous annotations of the BioAfrica HIV-1 Proteome Resource, which has been accessed by approximately 10 000 unique users a year since its inception in 2005. The novel features include a dedicated new page for each HIV protein, a graphic display of its function and a section on its interaction with host proteins. Our new webpages also add information on the genomic location of each HIV protein and the position of ARV drug resistance mutations. Our improved BioAfrica HIV-1 Proteome Resource fills a gap in the current knowledge of biocuration. Database URL: http://www.bioafrica.net/proteomics/HIVproteome.html PMID:27087306

  3. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL:http://rgd.mcw.edu.

  4. The Disease Portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database

    PubMed Central

    Hayman, G. Thomas; Laulederkind, Stanley J. F.; Smith, Jennifer R.; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene–disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL: http://rgd.mcw.edu PMID:27009807

  5. Improving interoperability between microbial information and sequence databases

    PubMed Central

    Romano, Paolo; Dawyndt, Peter; Piersigilli, Francesca; Swings, Jean

    2005-01-01

    Background Biological resources are essential tools for biomedical research. Their availability is promoted through on-line catalogues. Common Access to Biological Resources and Information (CABRI) is a service for distribution of biological resources and related data collected by 28 European culture collections. Linking this information to bioinformatics databanks can make the collections' holdings more visible after a search in molecular biology databanks and vice-versa. Identification of links to sequence databases can be useful, but annotation and indexing problems, together with compilation errors, immediately arise. In this paper, we present our efforts for the identification of cross-references between CABRI catalogues and the EMBL Data Library and related results. Results An SRS site with both EMBL and CABRI catalogues has been set up. Ad-hoc changes in indexing scripts allowed to achieve homogeneous index keys and SRS link features have been used to identify links between databases. After manual checking and comparison with an alternative procedure, about 67,500 valid cross-references were identified, added to the EMBL Data Library and are now distributed with it. HTML links can be established from EMBL to CABRI network service. Procedures can be executed whenever needed. Conclusion Links between EMBL and CABRI catalogues constitute an improved access to micro-organisms of certified quality and can produce positive effects on biomedical research. Further links between CABRI catalogues and other bioinformatics databases can now easily be defined by using these cross-references. Linking genetic information onto natural resources information may stand model for the integration of other databases containing empirical data on these materials. PMID:16351750

  6. ASPicDB: a database of annotated transcript and protein variants generated by alternative splicing.

    PubMed

    Martelli, Pier L; D'Antonio, Mattia; Bonizzoni, Paola; Castrignanò, Tiziana; D'Erchia, Anna M; D'Onorio De Meo, Paolo; Fariselli, Piero; Finelli, Michele; Licciulli, Flavio; Mangiulli, Marina; Mignone, Flavio; Pavesi, Giulio; Picardi, Ernesto; Rizzi, Raffaella; Rossi, Ivan; Valletti, Alessio; Zauli, Andrea; Zambelli, Federico; Casadio, Rita; Pesole, Graziano

    2011-01-01

    Alternative splicing is emerging as a major mechanism for the expansion of the transcriptome and proteome diversity, particularly in human and other vertebrates. However, the proportion of alternative transcripts and proteins actually endowed with functional activity is currently highly debated. We present here a new release of ASPicDB which now provides a unique annotation resource of human protein variants generated by alternative splicing. A total of 256,939 protein variants from 17,191 multi-exon genes have been extensively annotated through state of the art machine learning tools providing information of the protein type (globular and transmembrane), localization, presence of PFAM domains, signal peptides, GPI-anchor propeptides, transmembrane and coiled-coil segments. Furthermore, full-length variants can be now specifically selected based on the annotation of CAGE-tags and polyA signal and/or polyA sites, marking transcription initiation and termination sites, respectively. The retrieval can be carried out at gene, transcript, exon, protein or splice site level allowing the selection of data sets fulfilling one or more features settled by the user. The retrieval interface also enables the selection of protein variants showing specific differences in the annotated features. ASPicDB is available at http://www.caspur.it/ASPicDB/.

  7. Strategies for annotation and curation of translational databases: the eTUMOUR project.

    PubMed

    Julià-Sapé, Margarida; Lurgi, Miguel; Mier, Mariola; Estanyol, Francesc; Rafael, Xavier; Candiota, Ana Paula; Barceló, Anna; García, Alina; Martínez-Bisbal, M Carmen; Ferrer-Luna, Rubén; Moreno-Torres, Ángel; Celda, Bernardo; Arús, Carles

    2012-01-01

    The eTUMOUR (eT) multi-centre project gathered in vivo and ex vivo magnetic resonance (MR) data, as well as transcriptomic and clinical information from brain tumour patients, with the purpose of improving the diagnostic and prognostic evaluation of future patients. In order to carry this out, among other work, a database--the eTDB--was developed. In addition to complex permission rules and software and management quality control (QC), it was necessary to develop anonymization, processing and data visualization tools for the data uploaded. It was also necessary to develop sophisticated curation strategies that involved on one hand, dedicated fields for QC-generated meta-data and specialized queries and global permissions for senior curators and on the other, to establish a set of metrics to quantify its contents. The indispensable dataset (ID), completeness and pairedness indices were set. The database contains 1317 cases created as a result of the eT project and 304 from a previous project, INTERPRET. The number of cases fulfilling the ID was 656. Completeness and pairedness were heterogeneous, depending on the data type involved.

  8. A computational platform to maintain and migrate manual functional annotations for BioCyc databases

    USDA-ARS?s Scientific Manuscript database

    Model organism databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integratio...

  9. Citrus sinensis annotation project (CAP): a comprehensive database for sweet orange genome.

    PubMed

    Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  10. Recent improvements to the SMART domain-based sequence annotation resource.

    PubMed

    Letunic, Ivica; Goodstadt, Leo; Dickens, Nicholas J; Doerks, Tobias; Schultz, Joerg; Mott, Richard; Ciccarelli, Francesca; Copley, Richard R; Ponting, Chris P; Bork, Peer

    2002-01-01

    SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users' documents. A SMART mirror has been created at http://smart.ox.ac.uk.

  11. Recent improvements to the SMART domain-based sequence annotation resource

    PubMed Central

    Letunic, Ivica; Goodstadt, Leo; Dickens, Nicholas J.; Doerks, Tobias; Schultz, Joerg; Mott, Richard; Ciccarelli, Francesca; Copley, Richard R.; Ponting, Chris P.; Bork, Peer

    2002-01-01

    SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users’ documents. A SMART mirror has been created at http://smart.ox.ac.uk. PMID:11752305

  12. Glycoconjugate Data Bank: Structures--an annotated glycan structure database and N-glycan primary structure verification service.

    PubMed

    Nakahara, Taku; Hashimoto, Ryo; Nakagawa, Hiroaki; Monde, Kenji; Miura, Nobuaki; Nishimura, Shin-Ichiro

    2008-01-01

    Glycobiology has been brought to public attention as a frontier in the post-genomic era. Structural information about glycans has been accumulating in the Protein Data Bank (PDB) for years. It has been recognized, however, that there are many questionable glycan models in the PDB. A tool for verifying the primary structures of glycan 3D structures is evidently required, yet there have been no such publicly available tools. The Glycoconjugate Data Bank:Structures (GDB:Structures, http://www.glycostructures.jp) is an annotated glycan structure database, which also provides an N-glycan primary structure (or glycoform) verification service. All the glycan 3D structures are detected and annotated by an in-house program named 'getCARBO'. When an N-glycan is detected in a query coordinate by getCARBO, the primary structure of the glycan is compared with the most similar entry in the glycan primary structure database (KEGG GLYCAN), and unmatched substructure(s) are indicated if observed. The results of getCARBO are stored and presented in GDB:Structures.

  13. Strategies for annotation and curation of translational databases: the eTUMOUR project

    PubMed Central

    Julià-Sapé, Margarida; Lurgi, Miguel; Mier, Mariola; Estanyol, Francesc; Rafael, Xavier; Candiota, Ana Paula; Barceló, Anna; García, Alina; Martínez-Bisbal, M. Carmen; Ferrer-Luna, Rubén; Moreno-Torres, Àngel; Celda, Bernardo; Arús, Carles

    2012-01-01

    The eTUMOUR (eT) multi-centre project gathered in vivo and ex vivo magnetic resonance (MR) data, as well as transcriptomic and clinical information from brain tumour patients, with the purpose of improving the diagnostic and prognostic evaluation of future patients. In order to carry this out, among other work, a database—the eTDB—was developed. In addition to complex permission rules and software and management quality control (QC), it was necessary to develop anonymization, processing and data visualization tools for the data uploaded. It was also necessary to develop sophisticated curation strategies that involved on one hand, dedicated fields for QC-generated meta-data and specialized queries and global permissions for senior curators and on the other, to establish a set of metrics to quantify its contents. The indispensable dataset (ID), completeness and pairedness indices were set. The database contains 1317 cases created as a result of the eT project and 304 from a previous project, INTERPRET. The number of cases fulfilling the ID was 656. Completeness and pairedness were heterogeneous, depending on the data type involved. PMID:23180768

  14. A kingdom-specific protein domain HMM library for improved annotation of fungal genomes

    PubMed Central

    Alam, Intikhab; Hubbard, Simon J; Oliver, Stephen G; Rattray, Magnus

    2007-01-01

    Background Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs), which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models. Results Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam) using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied. Conclusion Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam. Therefore, we recommend

  15. Parallel proteomics to improve coverage and confidence in the partially annotated Oryctolagus cuniculus mitochondrial proteome.

    PubMed

    White, Melanie Y; Brown, David A; Sheng, Simon; Cole, Robert N; O'Rourke, Brian; Van Eyk, Jennifer E

    2011-02-01

    The ability to decipher the dynamic protein component of any system is determined by the inherent limitations of the technologies used, the complexity of the sample, and the existence of an annotated genome. In the absence of an annotated genome, large-scale proteomic investigations can be technically difficult. Yet the functional and biological species differences across animal models can lead to selection of partially or nonannotated organisms over those with an annotated genome. The outweighing of biology over technology leads us to investigate the degree to which a parallel approach can facilitate proteome coverage in the absence of complete genome annotation. When studying species without complete genome annotation, a particular challenge is how to ensure high proteome coverage while meeting the bioinformatic stringencies of high-throughput proteomics. A protein inventory of Oryctolagus cuniculus mitochondria was created by overlapping "protein-centric" and "peptide-centric" one-dimensional and two-dimensional liquid chromatography strategies; with additional partitioning into membrane-enriched and soluble fractions. With the use of these five parallel approaches, 2934 unique peptides were identified, corresponding to 558 nonredundant protein groups. 230 of these proteins (41%) were identified by only a single technical approach, confirming the need for parallel techniques to improve annotation. To determine the extent of coverage, a side-by-side comparison with human and mouse cardiomyocyte mitochondrial studies was performed. A nonredundant list of 995 discrete proteins was compiled, of which 244 (25%) were common across species. The current investigation identified 142 unique protein groups, the majority of which were detected here by only one technical approach, in particular peptide- and protein-centric two-dimensional liquid chromatography. Although no single approach achieved more than 40% coverage, the combination of three approaches (protein- and

  16. OligoRAP – an Oligo Re-Annotation Pipeline to improve annotation and estimate target specificity

    PubMed Central

    Neerincx, Pieter BT; Rauwerda, Han; Nie, Haisheng; Groenen, Martien AM; Breit, Timo M; Leunissen, Jack AM

    2009-01-01

    Background High throughput gene expression studies using oligonucleotide microarrays depend on the specificity of each oligonucleotide (oligo or probe) for its target gene. However, target specific probes can only be designed when a reference genome of the species at hand were completely sequenced, when this genome were completely annotated and when the genetic variation of the sampled individuals were completely known. Unfortunately there is not a single species for which such a complete data set is available. Therefore, it is important that probe annotation can be updated frequently for optimal interpretation of microarray experiments. Results In this paper we present OligoRAP, a pipeline to automatically update the annotation of oligo libraries and estimate oligo target specificity. OligoRAP uses a reference genome assembly with Ensembl and Entrez Gene annotation supplemented with a set of unmapped transcripts derived from RefSeq and UniGene to handle assembly gaps. OligoRAP produces alignments of each oligo with the reference assembly as well as with unmapped transcripts. These alignments are re-mapped to the annotation sources, which results in a concise, as complete as possible and up-to-date annotation of the oligo library. The building blocks of this pipeline are BioMoby web services creating a highly modular and distributed system with a robust, remote programmatic interface. OligoRAP was used to update the annotation for a subset of 791 oligos from the ARK-Genomics 20 K chicken array, which were selected as starting material for the oligo annotation session of the EADGENE/SABRE Post-analysis workshop. Based on the updated annotation about one third of these oligos is problematic with regard to target specificity. In addition, the accession numbers or ids the oligos were originally designed for no longer exist in the updated annotation for almost half of the oligos. Conclusion As microarrays are designed on incomplete data, it is important to update probe

  17. A database of annotated tentative orthologs from crop abiotic stress transcripts.

    PubMed

    Balaji, Jayashree; Crouch, Jonathan H; Petite, Prasad V N S; Hoisington, David A

    2006-10-07

    A minimal requirement to initiate a comparative genomics study on plant responses to abiotic stresses is a dataset of orthologous sequences. The availability of a large amount of sequence information, including those derived from stress cDNA libraries allow for the identification of stress related genes and orthologs associated with the stress response. Orthologous sequences serve as tools to explore genes and their relationships across species. For this purpose, ESTs from stress cDNA libraries across 16 crop species including 6 important cereal crops and 10 dicots were systematically collated and subjected to bioinformatics analysis such as clustering, grouping of tentative orthologous sets, identification of protein motifs/patterns in the predicted protein sequence, and annotation with stress conditions, tissue/library source and putative function. All data are available to the scientific community at http://intranet.icrisat.org/gt1/tog/homepage.htm. We believe that the availability of annotated plant abiotic stress ortholog sets will be a valuable resource for researchers studying the biology of environmental stresses in plant systems, molecular evolution and genomics.

  18. Citrus sinensis Annotation Project (CAP): A Comprehensive Database for Sweet Orange Genome

    PubMed Central

    Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/. PMID:24489955

  19. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    PubMed Central

    2011-01-01

    Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO) terms, and thousands of single-nucleotide polymorphisms (SNPs) were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49%) that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to further develop Silene as a

  20. RNA-Seq improves annotation of protein-coding genes in the cucumber genome

    PubMed Central

    2011-01-01

    Background As more and more genomes are sequenced, genome annotation becomes increasingly important in bridging the gap between sequence and biology. Gene prediction, which is at the center of genome annotation, usually integrates various resources to compute consensus gene structures. However, many newly sequenced genomes have limited resources for gene predictions. In an effort to create high-quality gene models of the cucumber genome (Cucumis sativus var. sativus), based on the EVidenceModeler gene prediction pipeline, we incorporated the massively parallel complementary DNA sequencing (RNA-Seq) reads of 10 cucumber tissues into EVidenceModeler. We applied the new pipeline to the reassembled cucumber genome and included a comparison between our predicted protein-coding gene sets and a published set. Results The reassembled cucumber genome, annotated with RNA-Seq reads from 10 tissues, has 23, 248 identified protein-coding genes. Compared with the published prediction in 2009, approximately 8, 700 genes reveal structural modifications and 5, 285 genes only appear in the reassembled cucumber genome. All the related results, including genome sequence and annotations, are available at http://cmb.bnu.edu.cn/Cucumis_sativus_v20/. Conclusions We conclude that RNA-Seq greatly improves the accuracy of prediction of protein-coding genes in the reassembled cucumber genome. The comparison between the two gene sets also suggests that it is feasible to use RNA-Seq reads to annotate newly sequenced or less-studied genomes. PMID:22047402

  1. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop.

    PubMed

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-10-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world's biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop.

  2. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop

    PubMed Central

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J.; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H.; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-01-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world’s biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop. PMID:21994619

  3. Eight Elements of High School Improvement: An Annotated Bibliography

    ERIC Educational Resources Information Center

    Jones, Wehmah

    2009-01-01

    This paper provides a table that contains a summary of resources available that support the National High School Center's "Eight Elements of High School Improvement: A Mapping Framework". This framework is based on eight common elements of systemic school reform and provides a structure for understanding, aligning, and prioritizing…

  4. Improving Automated Annotation of Benthic Survey Images Using Wide-band Fluorescence

    PubMed Central

    Beijbom, Oscar; Treibitz, Tali; Kline, David I.; Eyal, Gal; Khen, Adi; Neal, Benjamin; Loya, Yossi; Mitchell, B. Greg; Kriegman, David

    2016-01-01

    Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck. PMID:27021133

  5. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  6. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  7. Digital pathology annotation data for improved deep neural network classification

    NASA Astrophysics Data System (ADS)

    Kim, Edward; Mente, Sai Lakshmi Deepika; Keenan, Andrew; Gehlot, Vijay

    2017-03-01

    In the field of digital pathology, there is an explosive amount of imaging data being generated. Thus, there is an ever growing need to create assistive or automatic methods to analyze collections of images for screening and classification. Machine learning, specifically deep learning algorithms, developed for digital pathology have the potential to assist in this way. Deep learning architectures have demonstrated great success over existing classification models but require massive amounts of labeled training data that either doesn't exist or are cost and time prohibitive to obtain. In this project, we present a framework for representing, collecting, validating, and utilizing cytopathology features for improved neural network classification.

  8. A semi-quantitative, synteny-based method to improve functional predictions for hypothetical and poorly annotated bacterial and archaeal genes.

    PubMed

    Yelton, Alexis P; Thomas, Brian C; Simmons, Sheri L; Wilmes, Paul; Zemla, Adam; Thelen, Michael P; Justice, Nicholas; Banfield, Jillian F

    2011-10-01

    During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence in all 634 available Archaeal and Bacterial genomes from the NCBI database and four newly assembled genomes of uncultivated Archaea from an acid mine drainage (AMD) community. In parallel, we established and modeled the trend between synteny and functional relatedness in the 118 genomes available in the STRING database. By combining these models, we developed a gene functional annotation method that weights evolutionary distance to estimate the probability of functional associations of syntenous proteins between genome pairs. The method was applied to the hypothetical proteins and poorly annotated genes in newly assembled acid mine drainage Archaeal genomes to add or improve gene annotations. This is the first method to assign possible functions to poorly annotated genes through quantification of the probability of gene functional relationships based on synteny at a significant evolutionary distance, and has the potential for broad application.

  9. CruzDB: software for annotation of genomic intervals with UCSC genome-browser database

    PubMed Central

    Pedersen, Brent S.; Yang, Ivana V.; De, Subhajyoti

    2013-01-01

    Motivation: The biological significance of genomic features is often context dependent. Annotating a particular dataset with existing external data can provide insight into function. Results: We present CruzDB, a fast and intuitive programmatic interface to the University of California, Santa Cruz (UCSC) genome browser that facilitates integrative analyses of diverse local and remotely hosted datasets. We showcase the syntax of CruzDB using microRNA binding sites as examples, and further demonstrate its utility with three biological discoveries. First, DNA replication timing is stratified in gene regions—exons tend to replicate early and introns late during S phase. Second, several non-coding variants associated with cognitive functions map to lincRNA transcripts of relevant function, suggesting potential function of these regulatory RNAs in neuronal diseases. Third, lamina-associated genomic regions are highly enriched in olfaction-related genes, indicating a role of nuclear organization in their regulation. Availability: CruzDB is available at https://github.com/brentp/cruzdb under the MIT open-source license. Contact: bpederse@gmail.com or subhajyoti.de@ucdenver.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24037212

  10. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation

    PubMed Central

    McNeil, Leslie Klis; Reich, Claudia; Aziz, Ramy K.; Bartels, Daniela; Cohoon, Matthew; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Hwang, Kaitlyn; Kubal, Michael; Margaryan, Gohar Rem; Meyer, Folker; Mihalo, William; Olsen, Gary J.; Olson, Robert; Osterman, Andrei; Paarmann, Daniel; Paczian, Tobias; Parrello, Bruce; Pusch, Gordon D.; Rodionov, Dmitry A.; Shi, Xinghua; Vassieva, Olga; Vonstein, Veronika; Zagnitko, Olga; Xia, Fangfang; Zinner, Jenifer; Overbeek, Ross; Stevens, Rick

    2007-01-01

    The National Microbial Pathogen Data Resource (NMPDR) () is a National Institute of Allergy and Infections Disease (NIAID)-funded Bioinformatics Resource Center that supports research in selected Category B pathogens. NMPDR contains the complete genomes of ∼50 strains of pathogenic bacteria that are the focus of our curators, as well as >400 other genomes that provide a broad context for comparative analysis across the three phylogenetic Domains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide the most consistent genome annotations. Subsystems are sets of functional roles related by a biologically meaningful organizing principle, which are built over large collections of genomes; they provide researchers with consistent functional assignments in a biologically structured context. Investigators can browse subsystems and reactions to develop accurate reconstructions of the metabolic networks of any sequenced organism. NMPDR provides a comprehensive bioinformatics platform, with tools and viewers for genome analysis. Results of precomputed gene clustering analyses can be retrieved in tabular or graphic format with one-click tools. NMPDR tools include Signature Genes, which finds the set of genes in common or that differentiates two groups of organisms. Essentiality data collated from genome-wide studies have been curated. Drug target identification and high-throughput, in silico, compound screening are in development. PMID:17145713

  11. Improved methods and resources for paramecium genomics: transcription units, gene annotation and gene expression.

    PubMed

    Arnaiz, Olivier; Van Dijk, Erwin; Bétermier, Mireille; Lhuillier-Akakpo, Maoussi; de Vanssay, Augustin; Duharcourt, Sandra; Sallet, Erika; Gouzy, Jérôme; Sperling, Linda

    2017-06-26

    The 15 sibling species of the Paramecium aurelia cryptic species complex emerged after a whole genome duplication that occurred tens of millions of years ago. Given extensive knowledge of the genetics and epigenetics of Paramecium acquired over the last century, this species complex offers a uniquely powerful system to investigate the consequences of whole genome duplication in a unicellular eukaryote as well as the genetic and epigenetic mechanisms that drive speciation. High quality Paramecium gene models are important for research using this system. The major aim of the work reported here was to build an improved gene annotation pipeline for the Paramecium lineage. We generated oriented RNA-Seq transcriptome data across the sexual process of autogamy for the model species Paramecium tetraurelia. We determined, for the first time in a ciliate, candidate P. tetraurelia transcription start sites using an adapted Cap-Seq protocol. We developed TrUC, multi-threaded Perl software that in conjunction with TopHat mapping of RNA-Seq data to a reference genome, predicts transcription units for the annotation pipeline. We used EuGene software to combine annotation evidence. The high quality gene structural annotations obtained for P. tetraurelia were used as evidence to improve published annotations for 3 other Paramecium species. The RNA-Seq data were also used for differential gene expression analysis, providing a gene expression atlas that is more sensitive than the previously established microarray resource. We have developed a gene annotation pipeline tailored for the compact genomes and tiny introns of Paramecium species. A novel component of this pipeline, TrUC, predicts transcription units using Cap-Seq and oriented RNA-Seq data. TrUC could prove useful beyond Paramecium, especially in the case of high gene density. Accurate predictions of 3' and 5' UTR will be particularly valuable for studies of gene expression (e.g. nucleosome positioning, identification of cis

  12. A Post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs

    PubMed Central

    Swain, Martin T.; Tsai, Isheng J.; Assefa, Samual A.; Newbold, Chris; Berriman, Matthew; Otto, Thomas D.

    2013-01-01

    Genome projects now produce draft assemblies within weeks thanks to advanced high-throughput sequencing technologies. For milestone projects like E. coli or H. sapiens, teams of scientists were employed to manually curate and finish these genomes to a high standard. Nowadays, this is not feasible for most projects and the quality of genomes is generally of a much lower standard. This protocol describes software (PAGIT, post-assembly genome-improvement toolkit) to improve the quality of draft genomes. It offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence, and to exploit reference genomes (if available) for improving scaffolding and generating annotations. The protocol is most accessible for bacterial and small Eukaryotic genomes (up to 300 Mb), such as pathogenic bacteria, malaria and parasitic worms. Applying PAGIT to an E. coli assembly takes approximately 24 hours: it doubles the average contig size and annotates over 4300 gene models. PMID:22678431

  13. An annotated bibliography of selected guides for stream habitat improvement in the Pacific Northwest

    USGS Publications Warehouse

    Keim, R.F.; Price, A.B.; Hardin, T. S.; Skaugset, Arne E.; Bateman, D.S.; Gresswell, R.E.; Tesch, S. D.

    2004-01-01

    This annotated bibliography is a response to widespread interest in stream habitat improvement in the Pacific Northwest by land managers, governmental and nongovernmental organizations, and the lay public. Several guides to stream habitat improvement have been written in the past, but may not be easily accessible to people from diverse backgrounds. This annotated bibliography reviews 11 guides to stream habitat improvement so that readers can find literature appropriate to their needs. All reviews begin with summaries of the contents, stated audiences, and goals of each guide. Reviews also include subjective comments on the strengths and weaknesses of each guide. Finally, this bibliography includes recommendations of guides and combinations of guides judged most useful for a range of purposes. 

  14. Key2Ann: a tool to process sequence sets by replacing database identifiers with a human-readable annotation.

    PubMed

    Pürzer, Andreas; Grassmann, Felix; Birzer, Dietmar; Merkl, Rainer

    2011-03-04

    Deducing common properties or degrees of phylogenetic relationship by analyzing a grouping or clustering of sequence sets is a frequently used technique in computational biology. If interpreted by means of visual inspection, the conclusions depend for many of these applications on meaningful names for the input data. In accordance with the aim of the analysis, the sequences should be provided with names indicating the function of the genes or gene-products, the phylogenetic position or other properties characterizing the contributing species. However, sequences extracted from databases are most often annotated with identifiers which only implicitly contain the desired information. To solve this problem, we have designed and implemented a tool named Key2Ann, which replaces in multiple fasta files the database keys with short terms indicating the taxonomic position or other features like the gene name or the EC-number. In addition, properties like habitat, growth temperature or the degree of pathogenicity can be coded for microbial species. To allow for highest flexibility, the user can control the composition of the names by means of command line parameters. Key2Ann is written in Java and can be downloaded via http://www-bioinf.uni-regensburg.de/downl/Key2Ann.zip. We demonstrate the usage of Key2Ann by discussing three typical examples of phylogenetic analysis.

  15. GELBANK : A database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes.

    SciTech Connect

    Babnigg, G.; Giometti, C. S.; Biosciences Division

    2004-01-01

    GELBANK is a publicly available database of two-dimensional gel electrophoresis (2DE) gel patterns of proteomes from organisms with known genome information (available at and ftp://bioinformatics.anl.gov/gelbank/). Currently it includes 131 completed, mostly microbial proteomes available from the National Center for Biotechnology Information. A web interface allows the upload of 2D gel patterns and their annotation for registered users. The images are organized by species, tissue type, separation method, sample type and staining method. The database can be queried based on protein or 2DE-pattern attributes. A web interface allows registered users to assign molecular weight and pH gradient profiles to their own 2D gel patterns as well as to link protein identifications to a given spot on the pattern. The website presents all of the submitted 2D gel patterns where the end-user can dynamically display the images or parts of images along with molecular weight, pH profile information and linked protein identification. A collection of images can be selected for the creation of animations from which the user can select sub-regions of interest and unlimited 2D gel patterns for visualization. The website currently presents 233 identifications for 81 gel patterns for Homo sapiens, Methanococcus jannaschii, Pyro coccus furiosus, Shewanella oneidensis, Escherichia coli and Deinococcus radiodurans.

  16. IsoSCM: improved and alternative 3′ UTR annotation using multiple change-point inference

    PubMed Central

    Shenker, Sol; Miura, Pedro; Sanfilippo, Piero

    2015-01-01

    Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3′ UTRs. Existing assembly strategies often fragment long 3′ UTRs, and importantly, none of the algorithms in popular use can apportion data into tandem 3′ UTR isoforms, which are frequently generated by alternative cleavage and polyadenylation (APA). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools. To address these limitations, we present a new method for transcript assembly, Isoform Structural Change Model (IsoSCM) that incorporates change-point analysis to improve the 3′ UTR annotation process. Through evaluation on simulated and genuine data sets, we demonstrate that IsoSCM annotates 3′ termini with higher sensitivity and specificity than can be achieved with existing methods. We highlight the utility of IsoSCM by demonstrating its ability to recover known patterns of tissue-regulated APA. IsoSCM will facilitate future efforts for 3′ UTR annotation and genome-wide studies of the breadth, regulation, and roles of APA leveraging RNA-seq data. The IsoSCM software and source code are available from our website https://github.com/shenkers/isoscm. PMID:25406361

  17. Enchytraeus albidus Microarray: Enrichment, Design, Annotation and Database (EnchyBASE)

    PubMed Central

    Novais, Sara C.; Arrais, Joel; Lopes, Pedro; Vandenbrouck, Tine; De Coen, Wim; Roelofs, Dick; Soares, Amadeu M. V. M.; Amorim, Mónica J. B.

    2012-01-01

    Enchytraeus albidus (Oligochaeta) is an ecologically relevant species used as standard test organisms for risk assessment. Effects of stressors in this species are commonly determined at the population level using reproduction and survival as endpoints. The assessment of transcriptomic responses can be very useful e.g. to understand underlying mechanisms of toxicity with gene expression fingerprinting. In the present paper the following is being addressed: 1) development of suppressive subtractive hybridization (SSH) libraries enriched for differentially expressed genes after metal and pesticide exposures; 2) sequencing and characterization of all generated cDNA inserts; 3) development of a publicly available genomic database on E. albidus. A total of 2100 Expressed Sequence Tags (ESTs) were isolated, sequenced and assembled into 1124 clusters (947 singletons and 177 contigs). From these sequences, 41% matched known proteins in GenBank (BLASTX, e-value≤10-5) and 37% had at least one Gene Ontology (GO) term assigned. In total, 5.5% of the sequences were assigned to a metabolic pathway, based on KEGG. With this new sequencing information, an Agilent custom oligonucleotide microarray was designed, representing a potential tool for transcriptomic studies. EnchyBASE (http://bioinformatics.ua.pt/enchybase/) was developed as a web freely available database containing genomic information on E. albidus and will be further extended in the near future for other enchytraeid species. The database so far includes all ESTs generated for E. albidus from three cDNA libraries. This information can be downloaded and applied in functional genomics and transcription studies. PMID:22558086

  18. Involving Undergraduates in the Annotation and Analysis of Global Gene Expression Studies: Creation of a Maize Shoot Apical Meristem Expression Database

    PubMed Central

    Buckner, Brent; Beck, Jon; Browning, Kate; Fritz, Ashleigh; Grantham, Lisa; Hoxha, Eneda; Kamvar, Zhian; Lough, Ashley; Nikolova, Olga; Schnable, Patrick S.; Scanlon, Michael J.; Janick-Buckner, Diane

    2007-01-01

    Through a multi-university and interdisciplinary project we have involved undergraduate biology and computer science research students in the functional annotation of maize genes and the analysis of their microarray expression patterns. We have created a database to house the results of our functional annotation of >4400 genes identified as being differentially regulated in the maize shoot apical meristem (SAM). This database is located at http://sam.truman.edu and is now available for public use. The undergraduate students involved in constructing this unique SAM database received hands-on training in an intellectually challenging environment, which has prepared them for graduate and professional careers in biological sciences. We describe our experiences with this project as a model for effective research-based teaching of undergraduate biology and computer science students, as well as for a rich professional development experience for faculty at predominantly undergraduate institutions. PMID:17409087

  19. NDER: A novel web application using annotated whole slide images for rapid improvements in human pattern recognition.

    PubMed

    Reder, Nicholas P; Glasser, Daniel; Dintzis, Suzanne M; Rendi, Mara H; Garcia, Rochelle L; Henriksen, Jonathan C; Kilgore, Mark R

    2016-01-01

    Whole-slide images (WSIs) present a rich source of information for education, training, and quality assurance. However, they are often used in a fashion similar to glass slides rather than in novel ways that leverage the advantages of WSI. We have created a pipeline to transform annotated WSI into pattern recognition training, and quality assurance web application called novel diagnostic electronic resource (NDER). Create an efficient workflow for extracting annotated WSI for use by NDER, an attractive web application that provides high-throughput training. WSI were annotated by a resident and classified into five categories. Two methods of extracting images and creating image databases were compared. Extraction Method 1: Manual extraction of still images and validation of each image by four breast pathologists. Extraction Method 2: Validation of annotated regions on the WSI by a single experienced breast pathologist and automated extraction of still images tagged by diagnosis. The extracted still images were used by NDER. NDER briefly displays an image, requires users to classify the image after time has expired, then gives users immediate feedback. The NDER workflow is efficient: annotation of a WSI requires 5 min and validation by an expert pathologist requires An additional one to 2 min. The pipeline is highly automated, with only annotation and validation requiring human input. NDER effectively displays hundreds of high-quality, high-resolution images and provides immediate feedback to users during a 30 min session. NDER efficiently uses annotated WSI to rapidly increase pattern recognition and evaluate for diagnostic proficiency.

  20. Molecular Dynamics Information Improves cis-Peptide-Based Function Annotation of Proteins.

    PubMed

    Das, Sreetama; Bhadra, Pratiti; Ramakumar, Suryanarayanarao; Pal, Debnath

    2017-08-04

    cis-Peptide bonds, whose occurrence in proteins is rare but evolutionarily conserved, are implicated to play an important role in protein function. This has led to their previous use in a homology-independent, fragment-match-based protein function annotation method. However, proteins are not static molecules; dynamics is integral to their activity. This is nicely epitomized by the geometric isomerization of cis-peptide to trans form for molecular activity. Hence we have incorporated both static (cis-peptide) and dynamics information to improve the prediction of protein molecular function. Our results show that cis-peptide information alone cannot detect functional matches in cases where cis-trans isomerization exists but 3D coordinates have been obtained for only the trans isomer or when the cis-peptide bond is incorrectly assigned as trans. On the contrary, use of dynamics information alone includes false-positive matches for cases where fragments with similar secondary structure show similar dynamics, but the proteins do not share a common function. Combining the two methods reduces errors while detecting the true matches, thereby enhancing the utility of our method in function annotation. A combined approach, therefore, opens up new avenues of improving existing automated function annotation methodologies.

  1. Annotating nonspecific SAGE tags with microarray data.

    PubMed

    Ge, Xijin; Jung, Yong-Chul; Wu, Qingfa; Kibbe, Warren A; Wang, San Ming

    2006-01-01

    SAGE (serial analysis of gene expression) detects transcripts by extracting short tags from the transcripts. Because of the limited length, many SAGE tags are shared by transcripts from different genes. Relying on sequence information in the general gene expression database has limited power to solve this problem due to the highly heterogeneous nature of the deposited sequences. Considering that the complexity of gene expression at a single tissue level should be much simpler than that in the general expression database, we reasoned that by restricting gene expression to tissue level, the accuracy of gene annotation for the nonspecific SAGE tags should be significantly improved. To test the idea, we developed a tissue-specific SAGE annotation database based on microarray data (). This database contains microarray expression information represented as UniGene clusters for 73 normal human tissues and 18 cancer tissues and cell lines. The nonspecific SAGE tag is first matched to the database by the same tissue type used by both SAGE and microarray analysis; then the multiple UniGene clusters assigned to the nonspecific SAGE tag are searched in the database under the matched tissue type. The UniGene cluster presented solely or at higher expression levels in the database is annotated to represent the specific gene for the nonspecific SAGE tags. The accuracy of gene annotation by this database was largely confirmed by experimental data. Our study shows that microarray data provide a useful source for annotating the nonspecific SAGE tags.

  2. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database.

    PubMed

    Engel, Stacia R; Cherry, J Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery.

  3. EnzDP: improved enzyme annotation for metabolic network reconstruction based on domain composition profiles.

    PubMed

    Nguyen, Nam-Ninh; Srihari, Sriganesh; Leong, Hon Wai; Chong, Ket-Fah

    2015-10-01

    Determining the entire complement of enzymes and their enzymatic functions is a fundamental step for reconstructing the metabolic network of cells. High quality enzyme annotation helps in enhancing metabolic networks reconstructed from the genome, especially by reducing gaps and increasing the enzyme coverage. Currently, structure-based and network-based approaches can only cover a limited number of enzyme families, and the accuracy of homology-based approaches can be further improved. Bottom-up homology-based approach improves the coverage by rebuilding Hidden Markov Model (HMM) profiles for all known enzymes. However, its clustering procedure relies firmly on BLAST similarity score, ignoring protein domains/patterns, and is sensitive to changes in cut-off thresholds. Here, we use functional domain architecture to score the association between domain families and enzyme families (Domain-Enzyme Association Scoring, DEAS). The DEAS score is used to calculate the similarity between proteins, which is then used in clustering procedure, instead of using sequence similarity score. We improve the enzyme annotation protocol using a stringent classification procedure, and by choosing optimal threshold settings and checking for active sites. Our analysis shows that our stringent protocol EnzDP can cover up to 90% of enzyme families available in Swiss-Prot. It achieves a high accuracy of 94.5% based on five-fold cross-validation. EnzDP outperforms existing methods across several testing scenarios. Thus, EnzDP serves as a reliable automated tool for enzyme annotation and metabolic network reconstruction. Available at: www.comp.nus.edu.sg/~nguyennn/EnzDP .

  4. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  5. KEGG orthology-based annotation of the predicted proteome of Acropora digitifera: ZoophyteBase - an open access and searchable database of a coral genome

    PubMed Central

    2013-01-01

    Background Contemporary coral reef research has firmly established that a genomic approach is urgently needed to better understand the effects of anthropogenic environmental stress and global climate change on coral holobiont interactions. Here we present KEGG orthology-based annotation of the complete genome sequence of the scleractinian coral Acropora digitifera and provide the first comprehensive view of the genome of a reef-building coral by applying advanced bioinformatics. Description Sequences from the KEGG database of protein function were used to construct hidden Markov models. These models were used to search the predicted proteome of A. digitifera to establish complete genomic annotation. The annotated dataset is published in ZoophyteBase, an open access format with different options for searching the data. A particularly useful feature is the ability to use a Google-like search engine that links query words to protein attributes. We present features of the annotation that underpin the molecular structure of key processes of coral physiology that include (1) regulatory proteins of symbiosis, (2) planula and early developmental proteins, (3) neural messengers, receptors and sensory proteins, (4) calcification and Ca2+-signalling proteins, (5) plant-derived proteins, (6) proteins of nitrogen metabolism, (7) DNA repair proteins, (8) stress response proteins, (9) antioxidant and redox-protective proteins, (10) proteins of cellular apoptosis, (11) microbial symbioses and pathogenicity proteins, (12) proteins of viral pathogenicity, (13) toxins and venom, (14) proteins of the chemical defensome and (15) coral epigenetics. Conclusions We advocate that providing annotation in an open-access searchable database available to the public domain will give an unprecedented foundation to interrogate the fundamental molecular structure and interactions of coral symbiosis and allow critical questions to be addressed at the genomic level based on combined aspects of

  6. A multi-label image annotation scheme based on improved SVM multiple kernel learning

    NASA Astrophysics Data System (ADS)

    Jin, Cong; Jin, Shu-Wei

    2017-02-01

    Multi-label image annotation (MIA) has been widely studied during recent years and many MIA schemes have been proposed. However, the most existing schemes are not satisfactory. In this paper, an improved multiple kernel learning (IMKL) method of support vector machine (SVM) is proposed to improve the classification accuracy of SVM, then a novel MIA scheme based on IMKL is presented, which uses the discriminant loss to control the number of top semantic labels, and the feature selection approach is also used for improving the performance of MIA. The experiment results show that proposed MIA scheme achieves higher the performance than the existing other MIA schemes, its performance is satisfactory for large image dataset.

  7. cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data.

    PubMed

    Samarakoon, Pubudu Saneth; Sorte, Hanne Sørmo; Stray-Pedersen, Asbjørg; Rødningen, Olaug Kristin; Rognes, Torbjørn; Lyle, Robert

    2016-01-14

    With advances in next generation sequencing technology and analysis methods, single nucleotide variants (SNVs) and indels can be detected with high sensitivity and specificity in exome sequencing data. Recent studies have demonstrated the ability to detect disease-causing copy number variants (CNVs) in exome sequencing data. However, exonic CNV prediction programs have shown high false positive CNV counts, which is the major limiting factor for the applicability of these programs in clinical studies. We have developed a tool (cnvScan) to improve the clinical utility of computational CNV prediction in exome data. cnvScan can accept input from any CNV prediction program. cnvScan consists of two steps: CNV screening and CNV annotation. CNV screening evaluates CNV prediction using quality scores and refines this using an in-house CNV database, which greatly reduces the false positive rate. The annotation step provides functionally and clinically relevant information using multiple source datasets. We assessed the performance of cnvScan on CNV predictions from five different prediction programs using 64 exomes from Primary Immunodeficiency (PIDD) patients, and identified PIDD-causing CNVs in three individuals from two different families. In summary, cnvScan reduces the time and effort required to detect disease-causing CNVs by reducing the false positive count and providing annotation. This improves the clinical utility of CNV detection in exome data.

  8. NDER: A novel web application using annotated whole slide images for rapid improvements in human pattern recognition

    PubMed Central

    Reder, Nicholas P.; Glasser, Daniel; Dintzis, Suzanne M.; Rendi, Mara H.; Garcia, Rochelle L.; Henriksen, Jonathan C.; Kilgore, Mark R.

    2016-01-01

    Context: Whole-slide images (WSIs) present a rich source of information for education, training, and quality assurance. However, they are often used in a fashion similar to glass slides rather than in novel ways that leverage the advantages of WSI. We have created a pipeline to transform annotated WSI into pattern recognition training, and quality assurance web application called novel diagnostic electronic resource (NDER). Aims: Create an efficient workflow for extracting annotated WSI for use by NDER, an attractive web application that provides high-throughput training. Materials and Methods: WSI were annotated by a resident and classified into five categories. Two methods of extracting images and creating image databases were compared. Extraction Method 1: Manual extraction of still images and validation of each image by four breast pathologists. Extraction Method 2: Validation of annotated regions on the WSI by a single experienced breast pathologist and automated extraction of still images tagged by diagnosis. The extracted still images were used by NDER. NDER briefly displays an image, requires users to classify the image after time has expired, then gives users immediate feedback. Results: The NDER workflow is efficient: annotation of a WSI requires 5 min and validation by an expert pathologist requires An additional one to 2 min. The pipeline is highly automated, with only annotation and validation requiring human input. NDER effectively displays hundreds of high-quality, high-resolution images and provides immediate feedback to users during a 30 min session. Conclusions: NDER efficiently uses annotated WSI to rapidly increase pattern recognition and evaluate for diagnostic proficiency. PMID:27563490

  9. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    PubMed

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  10. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  11. Annotated Bibliography: Understanding Ambulatory Care Practices in the Context of Patient Safety and Quality Improvement.

    PubMed

    Montano, Maria F; Mehdi, Harshal; Nash, David B

    2016-11-01

    The ambulatory care setting is an increasingly important component of the patient safety conversation. Inpatient safety is the primary focus of the vast majority of safety research and interventions, but the ambulatory setting is actually where most medical care is administered. Recent attention has shifted toward examining ambulatory care in order to implement better health care quality and safety practices. This annotated bibliography was created to analyze and augment the current literature on ambulatory care practices with regard to patient safety and quality improvement. By providing a thorough examination of current practices, potential improvement strategies in ambulatory care health care settings can be suggested. A better understanding of the myriad factors that influence delivery of patient care will catalyze future health care system development and implementation in the ambulatory setting.

  12. A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation

    PubMed Central

    Ogris, Christoph; Guala, Dimitri; Helleday, Thomas; Sonnhammer, Erik L. L.

    2017-01-01

    Analyzing gene expression patterns is a mainstay to gain functional insights of biological systems. A plethora of tools exist to identify significant enrichment of pathways for a set of differentially expressed genes. Most tools analyze gene overlap between gene sets and are therefore severely hampered by the current state of pathway annotation, yet at the same time they run a high risk of false assignments. A way to improve both true positive and false positive rates (FPRs) is to use a functional association network and instead look for enrichment of network connections between gene sets. We present a new network crosstalk analysis method BinoX that determines the statistical significance of network link enrichment or depletion between gene sets, using the binomial distribution. This is a much more appropriate statistical model than previous methods have employed, and as a result BinoX yields substantially better true positive and FPRs than was possible before. A number of benchmarks were performed to assess the accuracy of BinoX and competing methods. We demonstrate examples of how BinoX finds many biologically meaningful pathway annotations for gene sets from cancer and other diseases, which are not found by other methods. BinoX is available at http://sonnhammer.org/BinoX. PMID:27664219

  13. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database

    PubMed Central

    Engel, Stacia R.; Cherry, J. Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186

  14. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs.

    PubMed

    Liu, Xiaoming; Wu, Chunlei; Li, Chang; Boerwinkle, Eric

    2016-03-01

    The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study. A list of all potential nsSNVs and ssSNVs based on the human reference sequence were created and functional predictions and annotations were curated and compiled for each SNV. Here, we report a recent major update of the database to version 3.0. The SNV list has been rebuilt based on GENCODE 22 and currently the database includes 82,832,027 nsSNVs and ssSNVs. An attached database dbscSNV, which compiled all potential human SNVs within splicing consensus regions and their deleteriousness predictions, add another 15,030,459 potentially functional SNVs. Eleven prediction scores (MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4× fitCons, fathmm-MKL, and DANN) and allele frequencies from the UK10K cohorts and the Exome Aggregation Consortium (ExAC), among others, have been added. The original seven prediction scores in v2.0 (SIFT, 2× Polyphen2, LRT, MutationTaster, MutationAssessor, and FATHMM) as well as many SNV and gene functional annotations have been updated. dbNSFP v3.0 is freely available at http://sites.google.com/site/jpopgen/dbNSFP.

  15. Improving genome annotation of enterotoxigenic Escherichia coli TW10598 by a label-free quantitative MS/MS approach.

    PubMed

    Pettersen, Veronika Kuchařová; Steinsland, Hans; Wiker, Harald G

    2015-11-01

    The most commonly used genome annotation processes are to a great extent based on computational methods. However, those can only predict genes that have been described earlier or that have sequence signatures indicative of a gene function. Here, we report a synonymous proteogenomic approach for experimentally improving microbial genome annotation based on label-free quantitative MS/MS. The approach is exemplified by analysis of cell extracts from in vitro cultured enterotoxigenic Escherichia coli (ETEC) strain TW10598, as part of an effort to create a new reference ETEC genome sequence. The proteomic analysis yielded identification of 2060 proteins, out of which 312 proteins were originally described as hypothetical. For 84% of the identified proteins we have provided description of their relative quantitative levels, among others, for 20 abundantly expressed ETEC virulence factors. Proteogenomic mapping supported the existence of four protein-coding genes that had not been annotated, and led to correction of translation start positions of another nine. The addition of the proteomic analysis into TW10598 genome re-annotation project improved quality of the annotation, and provided experimental evidence for a significant portion of ETEC expressed proteome. Data are available via ProteomeXchange with identifier PXD002473 (http://proteomecentral.proteomexchange.org/dataset/PXD002473).

  16. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  17. A semantic analysis of the annotations of the human genome

    PubMed Central

    Khatri, Purvesh; Done, Bogdan; Rao, Archana; Done, Arina

    2008-01-01

    The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene–function assignments. A subsequent literature search found that 138 of these gene–functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene–function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database. Availability http://vortex.cs.wayne.edu/papers/semantic_analysis_bioinfo.pdf Contact sod@cs.wayne.edu PMID:15955782

  18. Perceived Usefulness of a Strategy-Based Peer Annotation System for Improving Academic Reading Comprehension

    ERIC Educational Resources Information Center

    Chen, I-Jung; Chen, Wen-Chun

    2016-01-01

    This study examines the enhancing effect of peer annotation on the academic English reading of nonnative-Englishspeaking graduate students. To facilitate peer collaboration, the present study included the development of a strategybased online reading system. Through peer annotation, the students not only achieved enhanced reading comprehension but…

  19. Perceived Usefulness of a Strategy-Based Peer Annotation System for Improving Academic Reading Comprehension

    ERIC Educational Resources Information Center

    Chen, I-Jung; Chen, Wen-Chun

    2016-01-01

    This study examines the enhancing effect of peer annotation on the academic English reading of nonnative-Englishspeaking graduate students. To facilitate peer collaboration, the present study included the development of a strategybased online reading system. Through peer annotation, the students not only achieved enhanced reading comprehension but…

  20. Bacterial genome annotation.

    PubMed

    Beckloff, Nicholas; Starkenburg, Shawn; Freitas, Tracey; Chain, Patrick

    2012-01-01

    Annotation of prokaryotic sequences can be separated into structural and functional annotation. Structural annotation is dependent on algorithmic interrogation of experimental evidence to discover the physical characteristics of a gene. This is done in an effort to construct accurate gene models, so understanding function or evolution of genes among organisms is not impeded. Functional annotation is dependent on sequence similarity to other known genes or proteins in an effort to assess the function of the gene. Combining structural and functional annotation across genomes in a comparative manner promotes higher levels of accurate annotation as well as an advanced understanding of genome evolution. As the availability of bacterial sequences increases and annotation methods improve, the value of comparative annotation will increase.

  1. Improved genome annotation through untargeted detection of pathway-specific metabolites

    PubMed Central

    2011-01-01

    Background Mass spectrometry-based metabolomics analyses have the potential to complement sequence-based methods of genome annotation, but only if raw mass spectral data can be linked to specific metabolic pathways. In untargeted metabolomics, the measured mass of a detected compound is used to define the location of the compound in chemical space, but uncertainties in mass measurements lead to "degeneracies" in chemical space since multiple chemical formulae correspond to the same measured mass. We compare two methods to eliminate these degeneracies. One method relies on natural isotopic abundances, and the other relies on the use of stable-isotope labeling (SIL) to directly determine C and N atom counts. Both depend on combinatorial explorations of the "chemical space" comprised of all possible chemical formulae comprised of biologically relevant chemical elements. Results Of 1532 metabolic pathways curated in the MetaCyc database, 412 contain a metabolite having a chemical formula unique to that metabolic pathway. Thus, chemical formulae alone can suffice to infer the presence of some metabolic pathways. Of 248,928 unique chemical formulae selected from the PubChem database, more than 95% had at least one degeneracy on the basis of accurate mass information alone. Consideration of natural isotopic abundance reduced degeneracy to 64%, but mainly for formulae less than 500 Da in molecular weight, and only if the error in the relative isotopic peak intensity was less than 10%. Knowledge of exact C and N atom counts as determined by SIL enabled reduced degeneracy, allowing for determination of unique chemical formula for 55% of the PubChem formulae. Conclusions To facilitate the assignment of chemical formulae to unknown mass-spectral features, profiling can be performed on cultures uniformly labeled with stable isotopes of nitrogen (15N) or carbon (13C). This makes it possible to accurately count the number of carbon and nitrogen atoms in each molecule, providing a

  2. Correction of the Caulobacter crescentus NA1000 genome annotation.

    PubMed

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  3. Development of the Community Health Improvement Navigator Database of Interventions

    PubMed Central

    Roy, Brita; Stanojevich, Joel; Stange, Paul; Jiwani, Nafisa; King, Raymond; Koo, Denise

    2016-01-01

    Summary With the passage of the Patient Protection and Affordable Care Act, the requirements for hospitals to achieve tax-exempt status include performing a triennial community health needs assessment and developing a plan to address identified needs. To address community health needs, multisector collaborative efforts to improve both health care and non–health care determinants of health outcomes have been the most effective and sustainable. In 2015, CDC released the Community Health Improvement Navigator to facilitate the development of these efforts. This report describes the development of the database of interventions included in the Community Health Improvement Navigator. The database of interventions allows the user to easily search for multisector, collaborative, evidence-based interventions to address the underlying causes of the greatest morbidity and mortality in the United States: tobacco use and exposure, physical inactivity, unhealthy diet, high cholesterol, high blood pressure, diabetes, and obesity. PMID:26917110

  4. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  5. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    PubMed Central

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-01-01

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations. PMID:19293273

  6. Dynamic multimedia annotation tool

    NASA Astrophysics Data System (ADS)

    Pfund, Thomas; Marchand-Maillet, Stephane

    2001-12-01

    Annotating image collections is crucial for different multimedia applications. Not only this provides an alternative access to visual information but it is a critical step to perform the evaluation of content-based image retrieval systems. Annotation is a tedious task so that there is a real need for developing tools that lighten the work of annotators. The tool should be flexible and offer customization so as to make the annotator the most comfortable. It should also automate the most tasks as possible. In this paper, we present a still image annotation tool that has been developed with the aim of being flexible and adaptive. The principle is to create a set of dynamic web pages that are an interface to a SQL database. The keyword set is fixed and every image receives from concurrent annotators a set of keywords along with time stamps and annotator Ids. Each annotator has the possibility of going back and forth within the collection and its previous annotations. He is helped by a number of search services and customization options. An administrative section allows the supervisor to control the parameter of the annotation, including the keyword set, given via an XML structure. The architecture of the tool is made flexible so as to accommodate further options through its development.

  7. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  8. A Coding System with Independent Annotations of Gesture Forms and Functions during Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE).

    PubMed

    Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian

    2015-03-01

    Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant's linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance.

  9. A Coding System with Independent Annotations of Gesture Forms and Functions during Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE)

    PubMed Central

    Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian

    2014-01-01

    Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant’s linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance. PMID:25667563

  10. Comparing the Effectiveness of Alerts and Dynamically Annotated Visualizations (DAVs) in Improving Clinical Decision Making.

    PubMed

    Rayo, Michael F; Kowalczyk, Nina; Liston, Beth W; Sanders, Elizabeth B-N; White, Susan; Patterson, Emily S

    2015-09-01

    The aim of this study was to compare the effectiveness of two types of real-time decision support, an interrupting pop-up alert and a noninterrupting dynamically annotated visualization (DAV), in reducing clinically inappropriate diagnostic imaging orders. Alerts in electronic health record software are frequently disregarded due to high false-alarm rates, interruptions, and uncertainty about what triggered the alert. In other settings, providing visualizations and improving understandability of the guidance has been shown to improve overall decision making. Using a between-subject design, we examined the effect of two forms of decision support, alerts and DAVs, on reducing the proportion of inappropriate diagnostic imaging orders for 11 patients in a simulated environment. Nine attending and 11 resident physicians with experience using an electronic health record were randomly assigned to the form of decision support. Secondary measures were self-reported understandability, algorithm transparency, and clinical relevance. Fewer inappropriate diagnostic imaging tests were ordered with DAVs than with alerts (18% vs. 34%, p < .001). The DAV was rated higher for all three secondary measures (p < .001) for all participants. DAVs were more effective than alerts in reducing inappropriate imaging orders and were preferred for all patient scenarios, especially scenarios where guidance was ambiguous or based on inaccurate information. Creating visualizations that are permanently displayed and vary in the strength of their guidance can mitigate the risk of system performance degradation due to incomplete or incorrect data. This interaction paradigm may be applicable for other settings with high false-alarm rates or where there is a need to reduce interruptions during decision making. © 2015, Human Factors and Ergonomics Society.

  11. Evaluating Computational Gene Ontology Annotations.

    PubMed

    Škunca, Nives; Roberts, Richard J; Steffen, Martin

    2017-01-01

    Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.

  12. Towards computational improvement of DNA database indexing and short DNA query searching

    PubMed Central

    Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska

    2014-01-01

    In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions are not reported, if the database is searched against a query shorter than nucleotides, such that is the length of the DNA database words being mapped and is the length of the query. A solution of this drawback is also presented. PMID:26019584

  13. Towards computational improvement of DNA database indexing and short DNA query searching.

    PubMed

    Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska

    2014-09-03

    In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

  14. AraPPISite: a database of fine-grained protein-protein interaction site annotations for Arabidopsis thaliana.

    PubMed

    Li, Hong; Yang, Shiping; Wang, Chuan; Zhou, Yuan; Zhang, Ziding

    2016-09-01

    Knowledge about protein interaction sites provides detailed information of protein-protein interactions (PPIs). To date, nearly 20,000 of PPIs from Arabidopsis thaliana have been identified. Nevertheless, the interaction site information has been largely missed by previously published PPI databases. Here, AraPPISite, a database that presents fine-grained interaction details for A. thaliana PPIs is established. First, the experimentally determined 3D structures of 27 A. thaliana PPIs are collected from the Protein Data Bank database and the predicted 3D structures of 3023 A. thaliana PPIs are modeled by using two well-established template-based docking methods. For each experimental/predicted complex structure, AraPPISite not only provides an interactive user interface for browsing interaction sites, but also lists detailed evolutionary and physicochemical properties of these sites. Second, AraPPISite assigns domain-domain interactions or domain-motif interactions to 4286 PPIs whose 3D structures cannot be modeled. In this case, users can easily query protein interaction regions at the sequence level. AraPPISite is a free and user-friendly database, which does not require user registration or any configuration on local machines. We anticipate AraPPISite can serve as a helpful database resource for the users with less experience in structural biology or protein bioinformatics to probe the details of PPIs, and thus accelerate the studies of plant genetics and functional genomics. AraPPISite is available at http://systbio.cau.edu.cn/arappisite/index.html .

  15. Expression of cinnamyl alcohol dehydrogenases and their putative homologues during Arabidopsis thaliana growth and development: lessons for database annotations?

    PubMed

    Kim, Sung-Jin; Kim, Kye-Won; Cho, Man-Ho; Franceschi, Vincent R; Davin, Laurence B; Lewis, Norman G

    2007-07-01

    A major goal currently in Arabidopsis research is determination of the (biochemical) function of each of its approximately 27,000 genes. To date, however, 12% of its genes actually have known biochemical roles. In this study, we considered it instructive to identify the gene expression patterns of nine (so-called AtCAD1-9) of 17 genes originally annotated by The Arabidopsis Information Resource (TAIR) as cinnamyl alcohol dehydrogenase (CAD, EC 1.1.1.195) homologues [see Costa, M.A., Collins, R.E., Anterola, A.M., Cochrane, F.C., Davin, L.B., Lewis N.G., 2003. An in silico assessment of gene function and organization of the phenylpropanoid pathway metabolic networks in Arabidopsis thaliana and limitations thereof. Phytochemistry 64, 1097-1112.]. In agreement with our biochemical studies in vitro [Kim, S.-J., Kim, M.-R., Bedgar, D.L., Moinuddin, S.G.A., Cardenas, C.L., Davin, L.B., Kang, C.-H., Lewis, N.G., 2004. Functional reclassification of the putative cinnamyl alcohol dehydrogenase multigene family in Arabidopsis. Proc. Natl. Acad. Sci. USA 101, 1455-1460.], and analysis of a double mutant [Sibout, R., Eudes, A., Mouille, G., Pollet, B., Lapierre, C., Jouanin, L., Séguin A., 2005. Cinnamyl Alcohol Dehydrogenase-C and -D are the primary genes involved in lignin biosynthesis in the floral stem of Arabidopsis. Plant Cell 17, 2059-2076.], both AtCAD5 (At4g34230) and AtCAD4 (At3g19450) were found to have expression patterns consistent with development/formation of different forms of the lignified vascular apparatus, e.g. lignifying stem tissues, bases of trichomes, hydathodes, abscission zones of siliques, etc. Expression was also observed in various non-lignifying zones (e.g. root caps) indicative of, perhaps, a role in plant defense. In addition, expression patterns of the four CAD-like homologues were investigated, i.e. AtCAD2 (At2g21730), AtCAD3 (At2g21890), AtCAD7 (At4g37980) and AtCAD8 (At4g37990), each of which previously had been demonstrated to have low CAD

  16. The Plant Ontology Database: A community resource for plant structure and developmental stages controlled vocabulary and annotations

    USDA-ARS?s Scientific Manuscript database

    The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aim to create, maintain, and facilitate the use of a controlled vocabulary(ontology) for plants. The ontology allows users to ascribe attributes o...

  17. The Plant Ontology Database: A Community Resource for Plant Structure and Developmental Stages Controlled Vocabulary and Annotations

    USDA-ARS?s Scientific Manuscript database

    The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary(ontology) for plants. The ontology allows users to ascribe attributes o...

  18. ERIC Digests: An Annotated Bibliography of All ERIC Digests Announced in the ERIC Database through March 1993.

    ERIC Educational Resources Information Center

    Pugh, Elizabeth, Comp.; Brandhorst, Ted, Comp.

    Digests prepared by clearinghouses of the Educational Resources Information Center (ERIC) began to emerge seriously in 1984 as one of the most desirable and popular products of the various clearinghouses. Although ERIC Digests were not initially listed in the ERIC database because of their brevity, today they are entered routinely. This…

  19. CIPRO 2.5: Ciona intestinalis protein database, a unique integrated repository of large-scale omics data, bioinformatic analyses and curated annotation, with user rating and reviewing functionality.

    PubMed

    Endo, Toshinori; Ueno, Keisuke; Yonezawa, Kouki; Mineta, Katsuhiko; Hotta, Kohji; Satou, Yutaka; Yamada, Lixy; Ogasawara, Michio; Takahashi, Hiroki; Nakajima, Ayako; Nakachi, Mia; Nomura, Mamoru; Yaguchi, Junko; Sasakura, Yasunori; Yamasaki, Chisato; Sera, Miho; Yoshizawa, Akiyasu C; Imanishi, Tadashi; Taniguchi, Hisaaki; Inaba, Kazuo

    2011-01-01

    The Ciona intestinalis protein database (CIPRO) is an integrated protein database for the tunicate species C. intestinalis. The database is unique in two respects: first, because of its phylogenetic position, Ciona is suitable model for understanding vertebrate evolution; and second, the database includes original large-scale transcriptomic and proteomic data. Ciona intestinalis has also been a favorite of developmental biologists. Therefore, large amounts of data exist on its development and morphology, along with a recent genome sequence and gene expression data. The CIPRO database is aimed at collecting those published data as well as providing unique information from unpublished experimental data, such as 3D expression profiling, 2D-PAGE and mass spectrometry-based large-scale analyses at various developmental stages, curated annotation data and various bioinformatic data, to facilitate research in diverse areas, including developmental, comparative and evolutionary biology. For medical and evolutionary research, homologs in humans and major model organisms are intentionally included. The current database is based on a recently developed KH model containing 36,034 unique sequences, but for higher usability it covers 89,683 all known and predicted proteins from all gene models for this species. Of these sequences, more than 10,000 proteins have been manually annotated. Furthermore, to establish a community-supported protein database, these annotations are open to evaluation by users through the CIPRO website. CIPRO 2.5 is freely accessible at http://cipro.ibio.jp/2.5.

  20. Total Quality Management in Higher Education: From Assessment to Improvement. An Annotated Bibliography, Second Edition.

    ERIC Educational Resources Information Center

    Peterson, Marvin W.; And Others

    This publication is an annotated bibliography of articles and key organizational sources on Total Quality Management (TQM) in higher education. The list was developed through searches of local and national data bases including Education Resources Information Center (ERIC), Public Affairs Information System (PAIS), Wilson Indexes to Journal…

  1. Aviation and the Environment. A Selected, Annotated Bibliography Related to Aviation's Responses Toward Improving the Environment.

    ERIC Educational Resources Information Center

    Marshall, Jane

    This informal, brief bibliography attempts to stress the positive side of aviation, annotating documents that explain how the airlines, aircraft engine manufacturers, government agencies, military aviation, and general aviation are meeting their responsibilities in solving environmental problems. Topics arousing public concern are identified:…

  2. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    PubMed Central

    Loevenich, Sandra N; Brunner, Erich; King, Nichole L; Deutsch, Eric W; Stein, Stephen E; Aebersold, Ruedi; Hafen, Ernst

    2009-01-01

    Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s) in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1) reduction of the complexity inherently associated with performing targeted proteomic studies, (2) designing and accelerating shotgun proteomics experiments, (3) confirming or questioning gene models, and (4) adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert. PMID:19210778

  3. Hedgehog Signaling Pathway Database: a repository of current annotation efforts and resources for the Hh research community.

    PubMed

    Hervold, Kieran; Martin, Andrew; Kirkpatrick, Roger A; Mc Kenna, Paul F; Ramirez-Weber, F A

    2007-01-01

    The Hedgehog Signaling Pathway Database is a curated repository of information pertaining to the Hedgehog developmental pathway. It was designed to provide centralized access to a wide range of relevant information in an organism-agnostic manner. Data are provided for all genes and gene targets known to be involved in the Hh pathway across various organisms. The data provided include DNA and protein sequences as well as domain structure motifs. All known human diseases associated with the Hh pathway are indexed including experimental data on therapeutic agents and their molecular targets. Hh researchers will find useful information on relevant protocols, tissue cell lines and reagents used in current Hh research projects. Curated content is also provided for publications, grants and patents relating to the Hh pathway. The database can be accessed at http://www.hedgehog.sfsu.edu.

  4. The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations.

    PubMed

    Avraham, Shulamit; Tung, Chih-Wei; Ilic, Katica; Jaiswal, Pankaj; Kellogg, Elizabeth A; McCouch, Susan; Pujar, Anuradha; Reiser, Leonore; Rhee, Seung Y; Sachs, Martin M; Schaeffer, Mary; Stein, Lincoln; Stevens, Peter; Vincent, Leszek; Zapata, Felipe; Ware, Doreen

    2008-01-01

    The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary (ontology) for plants. The ontology allows users to ascribe attributes of plant structure (anatomy and morphology) and developmental stages to data types, such as genes and phenotypes, to provide a semantic framework to make meaningful cross-species and database comparisons. The POC builds upon groundbreaking work by the Gene Ontology Consortium (GOC) by adopting and extending the GOC's principles, existing software and database structure. Over the past year, POC has added hundreds of ontology terms to associate with thousands of genes and gene products from Arabidopsis, rice and maize, which are available through a newly updated web-based browser (http://www.plantontology.org/amigo/go.cgi) for viewing, searching and querying. The Consortium has also implemented new functionalities to facilitate the application of PO in genomic research and updated the website to keep the contents current. In this report, we present a brief description of resources available from the website, changes to the interfaces, data updates, community activities and future enhancement.

  5. Improving knowledge management through the support of image examination and data annotation using DICOM structured reporting.

    PubMed

    Torres, José Salavert; Damian Segrelles Quilis, J; Espert, Ignacio Blanquer; García, Vicente Hernandez

    2012-12-01

    An important effort has been invested on improving the image diagnosis process in different medical areas using information technologies. The field of medical imaging involves two main data types: medical imaging and reports. Developments based on the DICOM standard have demonstrated to be a convenient and widespread solution among the medical community. The main objective of this work is to design a Web application prototype that will be able to improve diagnosis and follow-on of breast cancer patients. It is based on TRENCADIS middleware, which provides a knowledge-oriented storage model composed by federated repositories of DICOM image studies and DICOM-SR medical reports. The full structure and contents of the diagnosis reports are used as metadata for indexing images. The TRENCADIS infrastructure takes full advantage of Grid technologies by deploying multi-resource grid services that enable multiple views (reports schemes) of the knowledge database. The paper presents a real deployment of such Web application prototype in the Dr. Peset Hospital providing radiologists with a tool to create, store and search diagnostic reports based on breast cancer explorations (mammography, magnetic resonance, ultrasound, pre-surgery biopsy and post-surgery biopsy), improving support for diagnostics decisions. A technical details for use cases (outlining enhanced multi-resource grid services communication and processing steps) and interactions between actors and the deployed prototype are described. As a result, information is more structured, the logic is clearer, network messages have been reduced and, in general, the system is more resistant to failures.

  6. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis.

    PubMed

    Blohm, Philipp; Frishman, Goar; Smialowski, Pawel; Goebels, Florian; Wachinger, Benedikt; Ruepp, Andreas; Frishman, Dmitrij

    2014-01-01

    Knowledge about non-interacting proteins (NIPs) is important for training the algorithms to predict protein-protein interactions (PPIs) and for assessing the false positive rates of PPI detection efforts. We present the second version of Negatome, a database of proteins and protein domains that are unlikely to engage in physical interactions (available online at http://mips.helmholtz-muenchen.de/proj/ppi/negatome). Negatome is derived by manual curation of literature and by analyzing three-dimensional structures of protein complexes. The main methodological innovation in Negatome 2.0 is the utilization of an advanced text mining procedure to guide the manual annotation process. Potential non-interactions were identified by a modified version of Excerbt, a text mining tool based on semantic sentence analysis. Manual verification shows that nearly a half of the text mining results with the highest confidence values correspond to NIP pairs. Compared to the first version the contents of the database have grown by over 300%.

  7. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG

  8. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient

  9. Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.

    PubMed

    Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan

    2017-03-15

    The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation.

  10. DSSTOX STRUCTURE-SEARCHABLE PUBLIC TOXICITY DATABASE NETWORK: CURRENT PROGRESS AND NEW INITIATIVES TO IMPROVE CHEMO-BIOINFORMATICS CAPABILITIES

    EPA Science Inventory

    The EPA DSSTox website (http://www/epa.gov/nheerl/dsstox) publishes standardized, structure-annotated toxicity databases, covering a broad range of toxicity disciplines. Each DSSTox database features documentation written in collaboration with the source authors and toxicity expe...

  11. DSSTOX STRUCTURE-SEARCHABLE PUBLIC TOXICITY DATABASE NETWORK: CURRENT PROGRESS AND NEW INITIATIVES TO IMPROVE CHEMO-BIOINFORMATICS CAPABILITIES

    EPA Science Inventory

    The EPA DSSTox website (http://www/epa.gov/nheerl/dsstox) publishes standardized, structure-annotated toxicity databases, covering a broad range of toxicity disciplines. Each DSSTox database features documentation written in collaboration with the source authors and toxicity expe...

  12. Genic and Intergenic SSR Database Generation, SNPs Determination and Pathway Annotations, in Date Palm (Phoenix dactylifera L.)

    PubMed Central

    2016-01-01

    The present investigation was carried out aiming to use the bioinformatics tools in order to identify and characterize, simple sequence repeats within the third Version of the date palm genome and develop a new SSR primers database. In addition single nucleotide polymorphisms (SNPs) that are located within the SSR flanking regions were recognized. Moreover, the pathways for the sequences assigned by SSR primers, the biological functions and gene interaction were determined. A total of 172,075 SSR motifs was identified on date palm genome sequence with a frequency of 450.97 SSRs per Mb. Out of these, 130,014 SSRs (75.6%) were located within the intergenic regions with a frequency of 499 SSRs per Mb. While, only 42,061 SSRs (24.4%) were located within the genic regions with a frequency of 347.5 SSRs per Mb. A total of 111,403 of SSR primer pairs were designed, that represents 291.9 SSR primers per Mb. Out of the 111,403, only 31,380 SSR primers were in the genic regions, while 80,023 primers were in the intergenic regions. A number of 250,507 SNPs were recognized in 84,172 SSR flanking regions, which represents 75.55% of the total SSR flanking regions. Out of 12,274 genes only 463 genes comprising 896 SSR primers were mapped onto 111 pathways using KEGG data base. The most abundant enzymes were identified in the pathway related to the biosynthesis of antibiotics. We tested 1031 SSR primers using both publicly available date palm genome sequences as templates in the in silico PCR reactions. Concerning in vitro validation, 31 SSR primers among those used in the in silico PCR were synthesized and tested for their ability to detect polymorphism among six Egyptian date palm cultivars. All tested primers have successfully amplified products, but only 18 primers detected polymorphic amplicons among the studied date palm cultivars. PMID:27434138

  13. U.S. EPA computational toxicology programs: Central role of chemical-annotation efforts and molecular databases

    EPA Science Inventory

    EPA’s National Center for Computational Toxicology is engaged in high-profile research efforts to improve the ability to more efficiently and effectively prioritize and screen thousands of environmental chemicals for potential toxicity. A central component of these efforts invol...

  14. U.S. EPA computational toxicology programs: Central role of chemical-annotation efforts and molecular databases

    EPA Science Inventory

    EPA’s National Center for Computational Toxicology is engaged in high-profile research efforts to improve the ability to more efficiently and effectively prioritize and screen thousands of environmental chemicals for potential toxicity. A central component of these efforts invol...

  15. Novel scripts for improved annotation and selection of variants from whole exome sequencing in cancer research.

    PubMed

    Hansen, Marcus Celik; Nederby, Line; Roug, Anne; Villesen, Palle; Kjeldsen, Eigil; Nyvold, Charlotte Guldborg; Hokland, Peter

    2015-01-01

    Sequencing the exome is quickly becoming the preferred method for discovering disease-inducing mutations. While obtaining data sets is a straightforward procedure, the subsequent analysis and interpretation of the data is a limiting step for clinical applications. Thus, while the initial mutation and variant calling can be performed by a bioinformatician or trained researcher, the output from robust packages such as MuTect and GATK is not directly informative for the general life scientists. In attempt to obviate this problem we have created complementary Wolfram scripts, which enable easy downstream annotation and selection, presented here in the perspective of hematological relevance. It also provides the researcher with the opportunity to extend the analysis by having a full-fledged programming and analysis environment of Mathematica at hand. In brief, post-processing is performed by: •Mapping of germ line and somatic variants to coding regions, and defining variant sets within Mathematica.•Processing of variants in variant effect predictor.•Extended annotation, relevance scoring and defining focus areas through the provided functions.

  16. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcine P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-03-17

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes.

  17. Novel scripts for improved annotation and selection of variants from whole exome sequencing in cancer research

    PubMed Central

    Hansen, Marcus Celik; Nederby, Line; Roug, Anne; Villesen, Palle; Kjeldsen, Eigil; Nyvold, Charlotte Guldborg; Hokland, Peter

    2015-01-01

    Sequencing the exome is quickly becoming the preferred method for discovering disease-inducing mutations. While obtaining data sets is a straightforward procedure, the subsequent analysis and interpretation of the data is a limiting step for clinical applications. Thus, while the initial mutation and variant calling can be performed by a bioinformatician or trained researcher, the output from robust packages such as MuTect and GATK is not directly informative for the general life scientists. In attempt to obviate this problem we have created complementary Wolfram scripts, which enable easy downstream annotation and selection, presented here in the perspective of hematological relevance. It also provides the researcher with the opportunity to extend the analysis by having a full-fledged programming and analysis environment of Mathematica at hand. In brief, post-processing is performed by: • Mapping of germ line and somatic variants to coding regions, and defining variant sets within Mathematica. • Processing of variants in variant effect predictor. • Extended annotation, relevance scoring and defining focus areas through the provided functions. PMID:26150983

  18. Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm

    PubMed Central

    Kind, Tobias; Fiehn, Oliver

    2006-01-01

    Background Metabolomic studies are targeted at identifying and quantifying all metabolites in a given biological context. Among the tools used for metabolomic research, mass spectrometry is one of the most powerful tools. However, metabolomics by mass spectrometry always reveals a high number of unknown compounds which complicate in depth mechanistic or biochemical understanding. In principle, mass spectrometry can be utilized within strategies of de novo structure elucidation of small molecules, starting with the computation of the elemental composition of an unknown metabolite using accurate masses with errors <5 ppm (parts per million). However even with very high mass accuracy (<1 ppm) many chemically possible formulae are obtained in higher mass regions. In automatic routines an additional orthogonal filter therefore needs to be applied in order to reduce the number of potential elemental compositions. This report demonstrates the necessity of isotope abundance information by mathematical confirmation of the concept. Results High mass accuracy (<1 ppm) alone is not enough to exclude enough candidates with complex elemental compositions (C, H, N, S, O, P, and potentially F, Cl, Br and Si). Use of isotopic abundance patterns as a single further constraint removes >95% of false candidates. This orthogonal filter can condense several thousand candidates down to only a small number of molecular formulas. Example calculations for 10, 5, 3, 1 and 0.1 ppm mass accuracy are given. Corresponding software scripts can be downloaded from . A comparison of eight chemical databases revealed that PubChem and the Dictionary of Natural Products can be recommended for automatic queries using molecular formulae. Conclusion More than 1.6 million molecular formulae in the range 0–500 Da were generated in an exhaustive manner under strict observation of mathematical and chemical rules. Assuming that ion species are fully resolved (either by chromatography or by high resolution

  19. Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker.

    PubMed

    Wildeman, Martin; van Ophuizen, Ernest; den Dunnen, Johan T; Taschner, Peter E M

    2008-01-01

    Unambiguous and correct sequence variant descriptions are of utmost importance, not in the least since mistakes and uncertainties may lead to undesired errors in clinical diagnosis. We developed the Mutation Analyzer (Mutalyzer) sequence variation nomenclature checker (www.lovd.nl/mutalyzer; last accessed 13 September 2007) for automated analysis and correction of sequence variant descriptions using reference sequences from any organism. Mutalyzer handles most variation types: substitution, deletion, duplication, insertion, indel, and splice-site changes following current recommendations of the Human Genome Variation Society (HGVS). Input is a GenBank accession number or an uploaded reference sequence file in GenBank format with user-modified annotation, an HGNC gene symbol, and the variant (single or in a batch file). Mutalyzer generates variant descriptions at DNA level, the level of all annotated transcripts and the deduced outcome at protein level. To validate Mutalyzer's performance and to investigate the sequence variant description quality in locus-specific mutation databases (LSDBs), more than 11,000 variants in the PAH, BIC BRCA2, and HbVar databases were analyzed, showing that 87%, 25%, and 38%, respectively, were error-free and following the recommendations. Low recognition rates in BIC and HbVar (38% and 51%, respectively) were due to lack of a well-annotated genomic reference sequence (HbVar) or noncompliance to the guidelines (BRCA2). Provided with well-annotated genomic reference sequences, Mutalyzer is very effective for the curation of newly discovered sequence variation descriptions and existing LSDB data. Mutalyzer will be linked to the Leiden Open source Variation Database (LOVD) (www.LOVD.nl; last accessed 13 September 2007) and is the first module of a sequence variant effect prediction package. (c) 2007 Wiley-Liss, Inc.

  20. Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotation.

    PubMed

    Nowrousian, Minou; Würtz, Christian; Pöggeler, Stefanie; Kück, Ulrich

    2004-03-01

    One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome.

  1. Novelty Indicator for Enhanced Prioritization of Predicted Gene Ontology Annotations.

    PubMed

    Chicco, Davide; Palluzzi, Fernando; Masseroli, Marco

    2017-04-18

    Biomolecular controlled annotations have become pivotal in computational biology, because they allow scientists to analyze large amounts of biological data to better understand test results, and to infer new knowledge. Yet, biomolecular annotation databases are incomplete by definition, like our knowledge of biology, and might contain errors and inconsistent information. In this context, machine-learning algorithms able to predict and prioritize new annotations are both effective and efficient, especially if compared with time-consuming trials of biological validation. To limit the possibility that these techniques predict obvious and trivial high-level features, and to help prioritizing their results, we introduce a new element that can improve accuracy and relevance of the results of an annotation prediction and prioritization pipeline. We propose a novelty indicator able to state the level of "originality" of the annotations predicted for a specific gene to Gene Ontology (GO) terms. This indicator, joint with our previously introduced prediction steps, helps prioritizing the most novel interesting annotations predicted. We performed an accurate biological functional analysis of the prioritized annotations predicted with high accuracy by our indicator and previously proposed methods. The relevance of our biological findings proves effectiveness and trustworthiness of our indicator and of its prioritization of predicted annotations.

  2. The Ensembl gene annotation system

    PubMed Central

    Aken, Bronwen L.; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J.; Murphy, Daniel N.; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y. Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html PMID:27337980

  3. Improvement of barley genome annotations by deciphering the Haruna Nijo genome.

    PubMed

    Sato, Kazuhiro; Tanaka, Tsuyoshi; Shigenobu, Shuji; Motoi, Yuka; Wu, Jianzhong; Itoh, Takeshi

    2016-02-01

    Full-length (FL) cDNA sequences provide the most reliable evidence for the presence of genes in genomes. In this report, detailed gene structures of barley, whole genome shotgun (WGS) and additional transcript data of the cultivar Haruna Nijo were quality controlled and compared with the published Morex genome information. Haruna Nijo scaffolds have longer total sequence length with much higher N50 and fewer sequences than those in Morex WGS contigs. The longer Haruna Nijo scaffolds provided efficient FLcDNA mapping, resulting in high coverage and detection of the transcription start sites. In combination with FLcDNAs and RNA-Seq data from four different tissue samples of Haruna Nijo, we identified 51,249 gene models on 30,606 loci. Overall sequence similarity between Haruna Nijo and Morex genome was 95.99%, while that of exon regions was higher (99.71%). These sequence and annotation data of Haruna Nijo are combined with Morex genome data and released from a genome browser. The genome sequence of Haruna Nijo may provide detailed gene structures in addition to the current Morex barley genome information.

  4. Re-annotation of the woodland strawberry (Fragaria vesca) genome.

    PubMed

    Darwish, Omar; Shahan, Rachel; Liu, Zhongchi; Slovin, Janet P; Alkharouf, Nadim W

    2015-01-27

    Fragaria vesca is a low-growing, small-fruited diploid strawberry species commonly called woodland strawberry. It is native to temperate regions of Eurasia and North America and while it produces edible fruits, it is most highly useful as an experimental perennial plant system that can serve as a model for the agriculturally important Rosaceae family. A draft of the F. vesca genome sequence was published in 2011 [Nat Genet 43:223,2011]. The first generation annotation (version 1.1) were developed using GeneMark-ES+[Nuc Acids Res 33:6494,2005]which is a self-training gene prediction tool that relies primarily on the combination of ab initio predictions with mapping high confidence ESTs in addition to mapping gene deserts from transposable elements. Based on over 25 different tissue transcriptomes, we have revised the F. vesca genome annotation, thereby providing several improvements over version 1.1. The new annotation, which was achieved using Maker, describes many more predicted protein coding genes compared to the GeneMark generated annotation that is currently hosted at the Genome Database for Rosaceae ( http://www.rosaceae.org/ ). Our new annotation also results in an increase in the overall total coding length, and the number of coding regions found. The total number of gene predictions that do not overlap with the previous annotations is 2286, most of which were found to be homologous to other plant genes. We have experimentally verified one of the new gene model predictions to validate our results. Using the RNA-Seq transcriptome sequences from 25 diverse tissue types, the re-annotation pipeline improved existing annotations by increasing the annotation accuracy based on extensive transcriptome data. It uncovered new genes, added exons to current genes, and extended or merged exons. This complete genome re-annotation will significantly benefit functional genomic studies of the strawberry and other members of the Rosaceae.

  5. HAMAP in 2015: updates to the protein family classification and annotation system

    PubMed Central

    Pedruzzi, Ivo; Rivoire, Catherine; Auchincloss, Andrea H.; Coudert, Elisabeth; Keller, Guillaume; de Castro, Edouard; Baratin, Delphine; Cuche, Béatrice A.; Bougueleret, Lydie; Poux, Sylvain; Redaschi, Nicole; Xenarios, Ioannis; Bridge, Alan

    2015-01-01

    HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the automatic classification and annotation of protein sequences. HAMAP provides annotation of the same quality and detail as UniProtKB/Swiss-Prot, using manually curated profiles for protein sequence family classification and expert curated rules for functional annotation of family members. HAMAP data and tools are made available through our website and as part of the UniRule pipeline of UniProt, providing annotation for millions of unreviewed sequences of UniProtKB/TrEMBL. Here we report on the growth of HAMAP and updates to the HAMAP system since our last report in the NAR Database Issue of 2013. We continue to augment HAMAP with new family profiles and annotation rules as new protein families are characterized and annotated in UniProtKB/Swiss-Prot; the latest version of HAMAP (as of 3 September 2014) contains 1983 family classification profiles and 1998 annotation rules (up from 1780 and 1720). We demonstrate how the complex logic of HAMAP rules allows for precise annotation of individual functional variants within large homologous protein families. We also describe improvements to our web-based tool HAMAP-Scan which simplify the classification and annotation of sequences, and the incorporation of an improved sequence-profile search algorithm. PMID:25348399

  6. Improving Recall Using Database Management Systems: A Learning Strategy.

    ERIC Educational Resources Information Center

    Jonassen, David H.

    1986-01-01

    Describes the use of microcomputer database management systems to facilitate the instructional uses of learning strategies relating to information processing skills, especially recall. Two learning strategies, cross-classification matrixing and node acquisition and integration, are highlighted. (Author/LRW)

  7. Algal functional annotation tool

    SciTech Connect

    Lopez, D.; Casero, D.; Cokus, S. J.; Merchant, S. S.; Pellegrini, M.

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  8. Genix: a new online automated pipeline for bacterial genome annotation.

    PubMed

    Kremer, Frederico Schmitt; Eslabão, Marcus Redü; Dellagostin, Odir Antônio; Pinto, Luciano da Silva

    2016-12-01

    Next-generation sequencing has significantly reduced the cost of genome-sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high-quality results.

  9. WOVOdat, A Worldwide Volcano Unrest Database, to Improve Eruption Forecasts

    NASA Astrophysics Data System (ADS)

    Widiwijayanti, C.; Costa, F.; Win, N. T. Z.; Tan, K.; Newhall, C. G.; Ratdomopurbo, A.

    2015-12-01

    WOVOdat is the World Organization of Volcano Observatories' Database of Volcanic Unrest. An international effort to develop common standards for compiling and storing data on volcanic unrests in a centralized database and freely web-accessible for reference during volcanic crises, comparative studies, and basic research on pre-eruption processes. WOVOdat will be to volcanology as an epidemiological database is to medicine. Despite the large spectrum of monitoring techniques, the interpretation of monitoring data throughout the evolution of the unrest and making timely forecasts remain the most challenging tasks for volcanologists. The field of eruption forecasting is becoming more quantitative, based on the understanding of the pre-eruptive magmatic processes and dynamic interaction between variables that are at play in a volcanic system. Such forecasts must also acknowledge and express the uncertainties, therefore most of current research in this field focused on the application of event tree analysis to reflect multiple possible scenarios and the probability of each scenario. Such forecasts are critically dependent on comprehensive and authoritative global volcano unrest data sets - the very information currently collected in WOVOdat. As the database becomes more complete, Boolean searches, side-by-side digital and thus scalable comparisons of unrest, pattern recognition, will generate reliable results. Statistical distribution obtained from WOVOdat can be then used to estimate the probabilities of each scenario after specific patterns of unrest. We established main web interface for data submission and visualizations, and have now incorporated ~20% of worldwide unrest data into the database, covering more than 100 eruptive episodes. In the upcoming years we will concentrate in acquiring data from volcano observatories develop a robust data query interface, optimizing data mining, and creating tools by which WOVOdat can be used for probabilistic eruption

  10. Improving data accuracy of commercial food outlet databases.

    PubMed

    Ohri-Vachaspati, Punam; Martinez, Diane; Yedidia, Michael J; Petlick, Nirvana

    2011-01-01

    PURPOSE. Assessing food environments often requires using commercially available data. Disparate methods used for classifying food outlets in these databases call for creating a classification approach using common definitions. A systematic strategy for reclassifying food stores and restaurants, as they appear in commercial databases, into categories that differentiate the availability of healthy options is described here. DESIGN AND SETTING. Commercially available data on food outlets including names, addresses, North American Industry Classification System codes, and associated characteristics was obtained for five New Jersey communities. ANALYSIS. A reclassification methodology was developed using criteria and definitions from the literature to categorize food outlets based on availability of healthy options. Information in the database was supplemented by systematic Internet and key word searches, and from phone calls to food outlets. RESULTS. The methodology resulted in 622 supermarket/grocery stores, 183 convenience stores, and 148 specialty stores in the original data to be reclassified into 58 supermarkets, 30 grocery stores, 692 convenience stores, and 115 specialty stores. Outlets from the original list of 1485 full-service restaurants and 506 limited-service restaurants were reclassified as 563 full-service restaurants and 1247 limited-service restaurants. Reclassification resulted in less than one-seventh the number of supermarkets and grocery stores, more than three times the number of convenience stores, and twice as many limited-service restaurants-a much less healthy profile than the one generated by using exclusively the commercial databases. CONCLUSION. An explicit and replicable strategy is proposed for reclassifying food outlets in commercial databases into categories that differentiate on the basis of healthy food availability. The intent is to contribute towards building a consensus among researchers on definitions used in public health

  11. Assessment and improvement of Indian-origin rhesus macaque and Mauritian-origin cynomolgus macaque genome annotations using deep transcriptome sequencing data

    PubMed Central

    Peng, Xinxia; Pipes, Lenore; Xiong, Hao; Green, Richard R.; Jones, Daniel C.; Ruzzo, Walter L.; Schroth, Gary P.; Mason, Christopher E.; Palermo, Robert E.; Katze, Michael G.

    2014-01-01

    Background The genome annotations of rhesus (Macaca mulatta) and cynomolgus (Macaca fascicularis) macaques, two of the most common nonhuman primate animal models, are limited. Methods We analyzed large-scale macaque RNA-based next-generation sequencing (RNAseq) data to identify un-annotated macaque transcripts. Results For both macaque species, we uncovered thousands of novel isoforms for annotated genes and thousands of un-annotated intergenic transcripts enriched with non-coding RNAs. We also identified thousands of transcript sequences which are partially or completely ‘missing’ from current macaque genome assemblies. We showed that many newly identified transcripts were differentially expressed during SIV infection of rhesus macaques or during Ebola virus infection of cynomolgus macaques. Conclusions For two important macaque species, we uncovered thousands of novel isoforms and un-annotated intergenic transcripts including coding and non-coding RNAs, polyadenylated and non-polyadenylated transcripts. This resource will greatly improve future macaque studies, as demonstrated by their applications in infectious disease studies. PMID:24810475

  12. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    SciTech Connect

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S; Cottingham, Robert W

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to

  13. Graph-based sequence annotation using a data integration approach.

    PubMed

    Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan

    2008-08-25

    The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

  14. Adaptation and Recommendation Techniques to Improve the Quality of Annotations and the Relevance of Resources in Web 2.0 and Semantic Web-Based Applications

    NASA Astrophysics Data System (ADS)

    Torre, Ilaria

    The Web 2.0 and the Semantic Web represent different forms of evolution of the first-generation Web, and both of them enrich Web resources with semantic annotations. Recommendation and personalization of Web resources is another trend that becomes more and more important with the growth of information, and both the Web 2.0 and the Semantic Web are deeply connected to it. The objective of this paper is to analyze the contribution of recommendation and adaptation techniques to these paradigms and to investigate if these techniques can be used as a bridge for their integration. More specifically, the paper will focus on the contribution of adaptation and recommendation techniques to improve the quality of annotations in the Web 2.0, Semantic Web, and mixed approaches and the relevance of annotated resources that are retrieved or filtered to users.

  15. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology.

    PubMed

    Gibson, Molly K; Forsberg, Kevin J; Dantas, Gautam

    2015-01-01

    Antibiotic resistance is a dire clinical problem with important ecological dimensions. While antibiotic resistance in human pathogens continues to rise at alarming rates, the impact of environmental resistance on human health is still unclear. To investigate the relationship between human-associated and environmental resistomes, we analyzed functional metagenomic selections for resistance against 18 clinically relevant antibiotics from soil and human gut microbiota as well as a set of multidrug-resistant cultured soil isolates. These analyses were enabled by Resfams, a new curated database of protein families and associated highly precise and accurate profile hidden Markov models, confirmed for antibiotic resistance function and organized by ontology. We demonstrate that the antibiotic resistance functions that give rise to the resistance profiles observed in environmental and human-associated microbial communities significantly differ between ecologies. Antibiotic resistance functions that most discriminate between ecologies provide resistance to β-lactams and tetracyclines, two of the most widely used classes of antibiotics in the clinic and agriculture. We also analyzed the antibiotic resistance gene composition of over 6000 sequenced microbial genomes, revealing significant enrichment of resistance functions by both ecology and phylogeny. Together, our results indicate that environmental and human-associated microbial communities harbor distinct resistance genes, suggesting that antibiotic resistance functions are largely constrained by ecology.

  16. A new improved database to support spanish phenological observations

    NASA Astrophysics Data System (ADS)

    Romero-Fresneda, Ramiro; Martínez-Núñez, Lourdes; Botey-Fullat, Roser; Gallego-Abaroa, Teresa; De Cara-García, Juan Antonio; Rodríguez-Ballesteros, César

    2017-04-01

    Since the last 30 years, phenology has regained scientific interest as the most reported biological indicator of anthropogenic climate change. AEMET (Spanish National Meteorological Agency) has long records in the field of phenological observations, since the 1940s. However, there is a large variety of paper records which are necessary to digitalize. On the other hand, it had been necessary to adapt our methods to the World Meteorological Organization (WMO) guidelines (BBCH code, data documentation- metadata…) and to standardize phenological stages and species in order to provide information to PEP725 (Pan European Phenology Database). Consequently, AEMET is developing a long-term, multi-taxa phenological database to support research and scientific studies about climate, their variability and influence on natural ecosystems, agriculture, etc. This paper presents the steps that are being carried out in order to achieve this goal.

  17. The SUPERFAMILY database in 2004: additions and improvements.

    PubMed

    Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K; Chothia, Cyrus; Gough, Julian

    2004-01-01

    The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.

  18. The SUPERFAMILY database in 2004: additions and improvements

    PubMed Central

    Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K.; Chothia, Cyrus; Gough, Julian

    2004-01-01

    The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences. PMID:14681402

  19. Morphosyntactic annotation of CHILDES transcripts*

    PubMed Central

    SAGAE, KENJI; DAVIS, ERIC; LAVIE, ALON; MACWHINNEY, BRIAN; WINTNER, SHULY

    2014-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes. PMID:20334720

  20. Improved oilfield GHG accounting using a global oilfield database

    NASA Astrophysics Data System (ADS)

    Roberts, S.; Brandt, A. R.; Masnadi, M.

    2016-12-01

    The definition of oil is shifting in considerable ways. Conventional oil resources are declining as oil sands, heavy oils, and others emerge. Technological advances mean that these unconventional hydrocarbons are now viable resources. Meanwhile, scientific evidence is mounting that climate change is occurring. The oil sector is responsible for 35% of global greenhouse gas (GHG) emissions, but the climate impacts of these new unconventional oils are not well understood. As such, the Oil Climate Index (OCI) project has been an international effort to evaluate the total life-cycle environmental GHG emissions of different oil fields globally. Over the course of the first and second phases of the project, 30 and 75 global oil fields have been investigated, respectively. The 75 fields account for about 25% of global oil production. For the third phase of the project, it is aimed to expand the OCI to contain closing to 100% of global oil production; leading to the analysis of 8000 fields. To accomplish this, a robust database system is required to handle and manipulate the data. Therefore, the integration of the data into the computer science language SQL (Structured Query Language) was performed. The implementation of SQL allows users to process the data more efficiently than would be possible by using the previously established program (Microsoft Excel). Next, a graphic user interface (gui) was implemented, in the computer science language of C#, in order to make the data interactive; enabling people to update the database without prior knowledge of SQL being necessary.

  1. Improved structural annotation of protein-coding genes in the Meloidogyne hapla genome using RNA-Seq.

    PubMed

    Guo, Yuelong; Bird, David McK; Nielsen, Dahlia M

    2014-01-01

    As high-throughput cDNA sequencing (RNA-Seq) is increasingly applied to hypothesis-driven biological studies, the prediction of protein coding genes based on these data are usurping strictly in silico approaches. Compared with computationally derived gene predictions, structural annotation is more accurate when based on biological evidence, particularly RNA-Seq data. Here, we refine the current genome annotation for the Meloidogyne hapla genome utilizing RNA-Seq data. Published structural annotation defines 14 420 protein-coding genes in the M. hapla genome. Of these, 25% (3751) were found to exhibit some incongruence with RNA-Seq data. Manual annotation enabled these discrepancies to be resolved. Our analysis revealed 544 new gene models that were missing from the prior annotation. Additionally, 1457 transcribed regions were newly identified on the ends of as-yet-unjoined contigs. We also searched for trans-spliced leaders, and based on RNA-Seq data, identified genes that appear to be trans-spliced. Four 22-bp trans-spliced leaders were identified using our pipeline, including the known trans-spliced leader, which is the M. hapla ortholog of SL1. In silico predictions of trans-splicing were validated by comparison with earlier results derived from an independent cDNA library constructed to capture trans-spliced transcripts. The new annotation, which we term HapPep5, is publically available at www.hapla.org.

  2. GO annotation in InterPro: why stability does not indicate accuracy in a sea of changing annotations.

    PubMed

    Sangrador-Vegas, Amaia; Mitchell, Alex L; Chang, Hsin-Yu; Yong, Siew-Yit; Finn, Robert D

    2016-01-01

    The removal of annotation from biological databases is often perceived as an indicator of erroneous annotation. As a corollary, annotation stability is considered to be a measure of reliability. However, diverse data-driven events can affect the stability of annotations in both primary protein sequence databases and the protein family databases that are built upon the sequence databases and used to help annotate them. Here, we describe some of these events and their consequences for the InterPro database, and demonstrate that annotation removal or reassignment is not always linked to incorrect annotation by the curator. Database URL: http://www.ebi.ac.uk/interpro. © The Author(s) 2016. Published by Oxford University Press.

  3. Toward Topology Dualism: Improving the Accuracy of AS Annotations for Routers

    NASA Astrophysics Data System (ADS)

    Huffaker, Bradley; Dhamdhere, Amogh; Fomenkov, Marina; Claffy, Kc

    To describe, analyze, and model the topological and structural characteristics of the Internet, researchers use Internet maps constructed at the router or autonomous system (AS) level. Although progress has been made on each front individually, a dual graph representing connectivity of routers with AS labels remains an elusive goal. We take steps toward merging the router-level and AS-level views of the Internet. We start from a collection of traces, i.e. sequences of IP addresses obtained with large-scale traceroute measurements from a distributed set of vantage points. We use state-of-the-art alias resolution techniques to identify interfaces belonging to the same router. We develop novel heuristics to assign routers to ASes, producing an AS-router dual graph. We validate our router assignment heuristics using data provided by tier-1 and tier-2 ISPs and five research networks, and show that we successfully assign 80% of routers with interfaces from multiple ASes to the correct AS. When we include routers with interfaces from a single AS, the accuracy drops to 71%, due to the 24% of total inferred routers for which our measurement or alias resolution fails to find an interface belonging to the correct AS. We use our dual graph construct to estimate economic properties of the AS-router dual graph, such as the number of internal and border routers owned by different types of ASes. We also demonstrate how our techniques can improve IP-AS mapping, including resolving up to 62% of false loops we observed in AS paths derived from traceroutes.

  4. Genome re-annotation: a wiki solution?

    PubMed Central

    Salzberg, Steven L

    2007-01-01

    The annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution. PMID:17274839

  5. DSSTOX WEBSITE LAUNCH: IMPROVING PUBLIC ACCESS TO DATABASES FOR BUILDING STRUCTURE-TOXICITY PREDICTION MODELS

    EPA Science Inventory

    DSSTox Website Launch: Improving Public Access to Databases for Building Structure-Toxicity Prediction Models
    Ann M. Richard
    US Environmental Protection Agency, Research Triangle Park, NC, USA

    Distributed: Decentralized set of standardized, field-delimited databases,...

  6. DSSTOX WEBSITE LAUNCH: IMPROVING PUBLIC ACCESS TO DATABASES FOR BUILDING STRUCTURE-TOXICITY PREDICTION MODELS

    EPA Science Inventory

    DSSTox Website Launch: Improving Public Access to Databases for Building Structure-Toxicity Prediction Models
    Ann M. Richard
    US Environmental Protection Agency, Research Triangle Park, NC, USA

    Distributed: Decentralized set of standardized, field-delimited databases,...

  7. Annotated Videography.

    ERIC Educational Resources Information Center

    United States Holocaust Memorial Museum, Washington, DC.

    This annotated list of 43 videotapes recommended for classroom use addresses various themes for teaching about the Holocaust, including: (1) overviews of the Holocaust; (2) life before the Holocaust; (3) propaganda; (4) racism, anti-Semitism; (5) "enemies of the state"; (6) ghettos; (7) camps; (8) genocide; (9) rescue; (10) resistance;…

  8. Chemical annotation of small and peptide-like molecules at the Protein Data Bank.

    PubMed

    Young, Jasmine Y; Feng, Zukang; Dimitropoulos, Dimitris; Sala, Raul; Westbrook, John; Zhuravleva, Marina; Shao, Chenghua; Quesada, Martha; Peisach, Ezra; Berman, Helen M

    2013-01-01

    Over the past decade, the number of polymers and their complexes with small molecules in the Protein Data Bank archive (PDB) has continued to increase significantly. To support scientific advancements and ensure the best quality and completeness of the data files over the next 10 years and beyond, the Worldwide PDB partnership that manages the PDB archive is developing a new deposition and annotation system. This system focuses on efficient data capture across all supported experimental methods. The new deposition and annotation system is composed of four major modules that together support all of the processing requirements for a PDB entry. In this article, we describe one such module called the Chemical Component Annotation Tool. This tool uses information from both the Chemical Component Dictionary and Biologically Interesting molecule Reference Dictionary to aid in annotation. Benchmark studies have shown that the Chemical Component Annotation Tool provides significant improvements in processing efficiency and data quality. Database URL: http://wwpdb.org.

  9. Chemical annotation of small and peptide-like molecules at the Protein Data Bank

    PubMed Central

    Young, Jasmine Y.; Feng, Zukang; Dimitropoulos, Dimitris; Sala, Raul; Westbrook, John; Zhuravleva, Marina; Shao, Chenghua; Quesada, Martha; Peisach, Ezra; Berman, Helen M.

    2013-01-01

    Over the past decade, the number of polymers and their complexes with small molecules in the Protein Data Bank archive (PDB) has continued to increase significantly. To support scientific advancements and ensure the best quality and completeness of the data files over the next 10 years and beyond, the Worldwide PDB partnership that manages the PDB archive is developing a new deposition and annotation system. This system focuses on efficient data capture across all supported experimental methods. The new deposition and annotation system is composed of four major modules that together support all of the processing requirements for a PDB entry. In this article, we describe one such module called the Chemical Component Annotation Tool. This tool uses information from both the Chemical Component Dictionary and Biologically Interesting molecule Reference Dictionary to aid in annotation. Benchmark studies have shown that the Chemical Component Annotation Tool provides significant improvements in processing efficiency and data quality. Database URL: http://wwpdb.org PMID:24291661

  10. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

    PubMed

    Camon, Evelyn B; Barrell, Daniel G; Dimmer, Emily C; Lee, Vivian; Magrane, Michele; Maslen, John; Binns, David; Apweiler, Rolf

    2005-01-01

    The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic

  11. KEGG as a reference resource for gene and protein annotation

    PubMed Central

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-01

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. PMID:26476454

  12. KEGG as a reference resource for gene and protein annotation.

    PubMed

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-04

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Introduction to the STS National Database Series: Outcomes Analysis, Quality Improvement, and Patient Safety.

    PubMed

    Jacobs, Jeffrey P; Shahian, David M; Prager, Richard L; Edwards, Fred H; McDonald, Donna; Han, Jane M; D'Agostino, Richard S; Jacobs, Marshall L; Kozower, Benjamin D; Badhwar, Vinay; Thourani, Vinod H; Gaissert, Henning A; Fernandez, Felix G; Wright, Cam; Fann, James I; Paone, Gaetano; Sanchez, Juan A; Cleveland, Joseph C; Brennan, J Matthew; Dokholyan, Rachel S; O'Brien, Sean M; Peterson, Eric D; Grover, Frederick L; Patterson, G Alexander

    2015-12-01

    The Society of Thoracic Surgeons (STS) National Database is the foundation for most of the Society's quality, research, and patient safety activities. Beginning in January 2016 and repeating each year, The Annals of Thoracic Surgery will publish a monthly Database series of scholarly articles on outcomes analysis, quality improvement, and patient safety. Six articles will be directly derived from the STS National Database and will be published every other month: three articles on outcomes and quality (one each from the STS Adult Cardiac Surgery Database, the STS Congenital Heart Surgery Database, and the STS General Thoracic Surgery Database), and three articles on research (one from each of these three specialty databases). These six articles will alternate with five additional articles on topics related to patient safety. The final article, to be published in December, will provide a summary of the prior 11 manuscripts. This series will allow STS and its Workforces on National Databases, Research Development, and Patient Safety to convey timely information aimed at improving the quality and safety of cardiothoracic surgery. Copyright © 2015 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.

  14. The National Clinical Database as an Initiative for Quality Improvement in Japan

    PubMed Central

    Murakami, Arata; Hirata, Yasutaka; Motomura, Noboru; Miyata, Hiroaki; Iwanaka, Tadashi; Takamoto, Shinichi

    2014-01-01

    The JCVSD (Japan Cardiovascular Surgery Database) was organized in 2000 to improve the quality of cardiovascular surgery in Japan. Web-based data harvesting on adult cardiac surgery was started (Japan Adult Cardiovascular Surgery Database, JACVSD) in 2001, and on congenital heart surgery (Japan Congenital Cardiovascular Surgery Database, JCCVSD) in 2008. Both databases grew to become national databases by the end of 2013. This was influenced by the success of the Society for Thoracic Surgeons’ National Database, which contains comparable input items. In 2011, the Japanese Board of Cardiovascular Surgery announced that the JACVSD and JCCVSD data are to be used for board certification, which improved the quality of the first paperless and web-based board certification review undertaken in 2013. These changes led to a further step. In 2011, the National Clinical Database (NCD) was organized to investigate the feasibility of clinical databases in other medical fields, especially surgery. In the NCD, the board certification system of the Japan Surgical Society, the basic association of surgery was set as the first level in the hierarchy of specialties, and nine associations and six board certification systems were set at the second level as subspecialties. The NCD grew rapidly, and now covers 95% of total surgical procedures. The participating associations will release or have released risk models, and studies that use ‘big data’ from these databases have been published. The national databases have contributed to evidence-based medicine, to the accountability of medical professionals, and to quality assessment and quality improvement of surgery in Japan. PMID:25346898

  15. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-04

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de.

  16. PolyQ 2.0: an improved version of PolyQ, a database of human polyglutamine proteins

    PubMed Central

    Li, Chen; Nagel, Jeremy; Androulakis, Steve; Song, Jiangning; Buckle, Ashley M.

    2016-01-01

    Proteins with expanded polyglutamine (polyQ) repeats are involved in human neurodegenerative diseases, via a gain-of-function mechanism of neuronal toxicity involving protein conformational changes that result in the formation and deposition of β-sheet-rich aggregates. Aggregation is dependent on the context and properties of the host protein, such as domain context and location of the repeat tract. In order to explore this relationship in greater detail, here we describe PolyQ 2.0, an updated database that provides a comprehensive knowledgebase for human polyQ proteins. Compared with the previous PolyQ database, our new database provides a variety of substantial updates including detailed biological annotations and search options. Biological annotations in terms of domain context information, protein structural and functional annotation, single point mutations, predicted disordered regions, protein–protein interaction partners, metabolic/signaling pathways, post-translational modification sites and evolutionary information are made available. Several new database functionalities have also been provided, including search using multiple/combinatory keywords, and submission of new data entries. Also, several third-party plug-ins are employed to enhance data visualization in PolyQ 2.0. In PolyQ 2.0 the proteins are reclassified into 3 new categories and contain 9 reviewed disease-associated polyQ proteins, 105 reviewed non-disease polyQ proteins and 146 un-reviewed polyQ proteins (reviewed by UniProt curators). We envisage that this updated database will be a useful resource for functional and structural investigation of human polyQ proteins. Database URL: http://lightning.med.monash.edu/polyq2/ PMID:26980520

  17. PolyQ 2.0: an improved version of PolyQ, a database of human polyglutamine proteins.

    PubMed

    Li, Chen; Nagel, Jeremy; Androulakis, Steve; Song, Jiangning; Buckle, Ashley M

    2016-01-01

    Proteins with expanded polyglutamine (polyQ) repeats are involved in human neurodegenerative diseases, via a gain-of-function mechanism of neuronal toxicity involving protein conformational changes that result in the formation and deposition of β-sheet-rich aggregates. Aggregation is dependent on the context and properties of the host protein, such as domain context and location of the repeat tract. In order to explore this relationship in greater detail, here we describe PolyQ 2.0, an updated database that provides a comprehensive knowledgebase for human polyQ proteins. Compared with the previous PolyQ database, our new database provides a variety of substantial updates including detailed biological annotations and search options. Biological annotations in terms of domain context information, protein structural and functional annotation, single point mutations, predicted disordered regions, protein-protein interaction partners, metabolic/signaling pathways, post-translational modification sites and evolutionary information are made available. Several new database functionalities have also been provided, including search using multiple/combinatory keywords, and submission of new data entries. Also, several third-party plug-ins are employed to enhance data visualization in PolyQ 2.0. In PolyQ 2.0 the proteins are reclassified into 3 new categories and contain 9 reviewed disease-associated polyQ proteins, 105 reviewed non-disease polyQ proteins and 146 un-reviewed polyQ proteins (reviewed by UniProt curators). We envisage that this updated database will be a useful resource for functional and structural investigation of human polyQ proteins. Database URL: http://lightning.med.monash.edu/polyq2/. © The Author(s) 2016. Published by Oxford University Press.

  18. Improving classification in protein structure databases using text mining.

    PubMed

    Koussounadis, Antonis; Redfern, Oliver C; Jones, David T

    2009-05-05

    The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions. An optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of

  19. Mouse Phenome Database

    PubMed Central

    Grubb, Stephen C.; Bult, Carol J.; Bogue, Molly A.

    2014-01-01

    The Mouse Phenome Database (MPD; phenome.jax.org) was launched in 2001 as the data coordination center for the international Mouse Phenome Project. MPD integrates quantitative phenotype, gene expression and genotype data into a common annotated framework to facilitate query and analysis. MPD contains >3500 phenotype measurements or traits relevant to human health, including cancer, aging, cardiovascular disorders, obesity, infectious disease susceptibility, blood disorders, neurosensory disorders, drug addiction and toxicity. Since our 2012 NAR report, we have added >70 new data sets, including data from Collaborative Cross lines and Diversity Outbred mice. During this time we have completely revamped our homepage, improved search and navigational aspects of the MPD application, developed several web-enabled data analysis and visualization tools, annotated phenotype data to public ontologies, developed an ontology browser and released new single nucleotide polymorphism query functionality with much higher density coverage than before. Here, we summarize recent data acquisitions and describe our latest improvements. PMID:24243846

  20. A generic method for improving the spatial interoperability of medical and ecological databases.

    PubMed

    Ghenassia, A; Beuscart, J B; Ficheur, G; Occelli, F; Babykina, E; Chazard, E; Genin, M

    2017-10-03

    The availability of big data in healthcare and the intensive development of data reuse and georeferencing have opened up perspectives for health spatial analysis. However, fine-scale spatial studies of ecological and medical databases are limited by the change of support problem and thus a lack of spatial unit interoperability. The use of spatial disaggregation methods to solve this problem introduces errors into the spatial estimations. Here, we present a generic, two-step method for merging medical and ecological databases that avoids the use of spatial disaggregation methods, while maximizing the spatial resolution. Firstly, a mapping table is created after one or more transition matrices have been defined. The latter link the spatial units of the original databases to the spatial units of the final database. Secondly, the mapping table is validated by (1) comparing the covariates contained in the two original databases, and (2) checking the spatial validity with a spatial continuity criterion and a spatial resolution index. We used our novel method to merge a medical database (the French national diagnosis-related group database, containing 5644 spatial units) with an ecological database (produced by the French National Institute of Statistics and Economic Studies, and containing with 36,594 spatial units). The mapping table yielded 5632 final spatial units. The mapping table's validity was evaluated by comparing the number of births in the medical database and the ecological databases in each final spatial unit. The median [interquartile range] relative difference was 2.3% [0; 5.7]. The spatial continuity criterion was low (2.4%), and the spatial resolution index was greater than for most French administrative areas. Our innovative approach improves interoperability between medical and ecological databases and facilitates fine-scale spatial analyses. We have shown that disaggregation models and large aggregation techniques are not necessarily the best ways to

  1. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation

    PubMed Central

    Kristensen, David M.; Wolf, Yuri I.; Koonin, Eugene V.

    2017-01-01

    The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of ‘index’ orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. PMID:28053163

  2. FunctionAnnotator, a versatile and efficient web tool for non-model organism annotation.

    PubMed

    Chen, Ting-Wen; Gan, Ruei-Chi; Fang, Yi-Kai; Chien, Kun-Yi; Liao, Wei-Chao; Chen, Chia-Chun; Wu, Timothy H; Chang, Ian Yi-Feng; Yang, Chi; Huang, Po-Jung; Yeh, Yuan-Ming; Chiu, Cheng-Hsun; Huang, Tzu-Wen; Tang, Petrus

    2017-09-05

    Along with the constant improvement in high-throughput sequencing technology, an increasing number of transcriptome sequencing projects are carried out in organisms without decoded genome information and even on environmental biological samples. To study the biological functions of novel transcripts, the very first task is to identify their potential functions. We present a web-based annotation tool, FunctionAnnotator, which offers comprehensive annotations, including GO term assignment, enzyme annotation, domain/motif identification and predictions for subcellular localization. To accelerate the annotation process, we have optimized the computation processes and used parallel computing for all annotation steps. Moreover, FunctionAnnotator is designed to be versatile, and it generates a variety of useful outputs for facilitating other analyses. Here, we demonstrate how FunctionAnnotator can be helpful in annotating non-model organisms. We further illustrate that FunctionAnnotator can estimate the taxonomic composition of environmental samples and assist in the identification of novel proteins by combining RNA-Seq data with proteomics technology. In summary, FunctionAnnotator can efficiently annotate transcriptomes and greatly benefits studies focusing on non-model organisms or metatranscriptomes. FunctionAnnotator, a comprehensive annotation web-service tool, is freely available online at: http://fa.cgu.edu.tw/ . This new web-based annotator will shed light on field studies involving organisms without a reference genome.

  3. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

  4. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  5. Ranking biomedical annotations with annotator's semantic relevancy.

    PubMed

    Wu, Aihua

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.

  6. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

    PubMed Central

    Wattam, Alice R.; Davis, James J.; Assaf, Rida; Boisvert, Sébastien; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Dietrich, Emily M.; Disz, Terry; Gabbard, Joseph L.; Gerdes, Svetlana; Henry, Christopher S.; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K.; Olsen, Gary J.; Murphy-Olson, Daniel E.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Vonstein, Veronika; Warren, Andrew; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L.

    2017-01-01

    The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by ‘virtual integration’ to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics. PMID:27899627

  7. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center.

    PubMed

    Wattam, Alice R; Davis, James J; Assaf, Rida; Boisvert, Sébastien; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Dietrich, Emily M; Disz, Terry; Gabbard, Joseph L; Gerdes, Svetlana; Henry, Christopher S; Kenyon, Ronald W; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K; Olsen, Gary J; Murphy-Olson, Daniel E; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D; Shukla, Maulik; Vonstein, Veronika; Warren, Andrew; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L

    2017-01-04

    The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by 'virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  8. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

    DOE PAGES

    Davis, James J.; Brettin, Thomas; Dietrich, Emily M.; ...

    2016-11-28

    Here, the Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center. Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data.more » Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by `virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.« less

  9. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

    SciTech Connect

    Davis, James J.; Brettin, Thomas; Dietrich, Emily M.; Gerdes, Svetlana; Henry, Christopher S.; Murphy-Olson, Daniel E.; Olson, Robert; Parrello, Bruce D.; Pusch, Gordon D.; Shukla, Maulik P.; Vonstein, Veronika; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L.

    2016-11-28

    Here, the Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center. Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by `virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.

  10. Towards a Library of Standard Operating Procedures (SOPs) for (meta)genomic annotation

    SciTech Connect

    Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen

    2008-04-01

    Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.

  11. UCSC Data Integrator and Variant Annotation Integrator

    PubMed Central

    Hinrichs, Angie S.; Raney, Brian J.; Speir, Matthew L.; Rhead, Brooke; Casper, Jonathan; Karolchik, Donna; Kuhn, Robert M.; Rosenbloom, Kate R.; Zweig, Ann S.; Haussler, David; Kent, W. James

    2016-01-01

    Summary: Two new tools on the UCSC Genome Browser web site provide improved ways of combining information from multiple datasets, optionally including the user's own custom track data and/or data from track hubs. The Data Integrator combines columns from multiple data tracks, showing all items from the first track along with overlapping items from the other tracks. The Variant Annotation Integrator is tailored to adding functional annotations to variant calls; it offers a more restricted set of underlying data tracks but adds predictions of each variant's consequences for any overlapping or nearby gene transcript. When available, it optionally adds additional annotations including effect prediction scores from dbNSFP for missense mutations, ENCODE regulatory summary tracks and conservation scores. Availability and implementation: The web tools are freely available at http://genome.ucsc.edu/ and the underlying database is available for download at http://hgdownload.cse.ucsc.edu/. The software (written in C and Javascript) is available from https://genome-store.ucsc.edu/ and is freely available for academic and non-profit usage; commercial users must obtain a license. Contact: angie@soe.ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740527

  12. TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

    PubMed Central

    Lorenzi, Hernan A.; Hoover, Jeff; Inman, Jason; Safford, Todd; Murphy, Sean; Kagan, Leonid; Williamson, Shannon J.

    2011-01-01

    In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence. PMID:21886867

  13. Improvements to the Magnetics Information Consortium (MagIC) Paleo and Rock Magnetic Database

    NASA Astrophysics Data System (ADS)

    Jarboe, N.; Minnett, R.; Tauxe, L.; Koppers, A. A. P.; Constable, C.; Jonestrask, L.

    2015-12-01

    The Magnetic Information Consortium (MagIC) database (http://earthref.org/MagIC/) continues to improve the ease of data uploading and editing, the creation of complex searches, data visualization, and data downloads for the paleomagnetic, geomagnetic, and rock magnetic communities. Online data editing is now available and the need for proprietary spreadsheet software is therefore entirely negated. The data owner can change values in the database or delete entries through an HTML 5 web interface that resembles typical spreadsheets in behavior and uses. Additive uploading now allows for additions to data sets to be uploaded with a simple drag and drop interface. Searching the database has improved with the addition of more sophisticated search parameters and with the facility to use them in complex combinations. A comprehensive summary view of a search result has been added for increased quick data comprehension while a raw data view is available if one desires to see all data columns as stored in the database. Data visualization plots (ARAI, equal area, demagnetization, Zijderveld, etc.) are presented with the data when appropriate to aid the user in understanding the dataset. MagIC data associated with individual contributions or from online searches may be downloaded in the tab delimited MagIC text file format for susbsequent offline use and analysis. With input from the paleomagnetic, geomagnetic, and rock magnetic communities, the MagIC database will continue to improve as a data warehouse and resource.

  14. Adopting a corporate perspective on databases. Improving support for research and decision making.

    PubMed

    Meistrell, M; Schlehuber, C

    1996-03-01

    The Veterans Health Administration (VHA) is at the forefront of designing and managing health care information systems that accommodate the needs of clinicians, researchers, and administrators at all levels. Rather than using one single-site, centralized corporate database VHA has constructed several large databases with different configurations to meet the needs of users with different perspectives. The largest VHA database is the Decentralized Hospital Computer Program (DHCP), a multisite, distributed data system that uses decoupled hospital databases. The centralization of DHCP policy has promoted data coherence, whereas the decentralization of DHCP management has permitted system development to be done with maximum relevance to the users'local practices. A more recently developed VHA data system, the Event Driven Reporting system (EDR), uses multiple, highly coupled databases to provide workload data at facility, regional, and national levels. The EDR automatically posts a subset of DHCP data to local and national VHA management. The development of the EDR illustrates how adoption of a corporate perspective can offer significant database improvements at reasonable cost and with modest impact on the legacy system.

  15. Lessons learned while building the Deepwater Horizon Database: Toward improved data sharing in coastal science

    NASA Astrophysics Data System (ADS)

    Thessen, Anne E.; McGinnis, Sean; North, Elizabeth W.

    2016-02-01

    Process studies and coupled-model validation efforts in geosciences often require integration of multiple data types across time and space. For example, improved prediction of hydrocarbon fate and transport is an important societal need which fundamentally relies upon synthesis of oceanography and hydrocarbon chemistry. Yet, there are no publically accessible databases which integrate these diverse data types in a georeferenced format, nor are there guidelines for developing such a database. The objective of this research was to analyze the process of building one such database to provide baseline information on data sources and data sharing and to document the challenges and solutions that arose during this major undertaking. The resulting Deepwater Horizon Database was approximately 2.4 GB in size and contained over 8 million georeferenced data points collected from industry, government databases, volunteer networks, and individual researchers. The major technical challenges that were overcome were reconciliation of terms, units, and quality flags which were necessary to effectively integrate the disparate data sets. Assembling this database required the development of relationships with individual researchers and data managers which often involved extensive e-mail contacts. The average number of emails exchanged per data set was 7.8. Of the 95 relevant data sets that were discovered, 38 (40%) were obtained, either in whole or in part. Over one third (36%) of the requests for data went unanswered. The majority of responses were received after the first request (64%) and within the first week of the first request (67%). Although fewer than half of the potentially relevant datasets were incorporated into the database, the level of sharing (40%) was high compared to some other disciplines where sharing can be as low as 10%. Our suggestions for building integrated databases include budgeting significant time for e-mail exchanges, being cognizant of the cost versus

  16. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.

  17. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  18. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    PubMed

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function.

  19. Improving the genome annotation of the acarbose producer Actinoplanes sp. SE50/110 by sequencing enriched 5'-ends of primary transcripts.

    PubMed

    Schwientek, Patrick; Neshat, Armin; Kalinowski, Jörn; Klein, Andreas; Rückert, Christian; Schneiker-Bekel, Susanne; Wendler, Sergej; Stoye, Jens; Pühler, Alfred

    2014-11-20

    Actinoplanes sp. SE50/110 is the producer of the alpha-glucosidase inhibitor acarbose, which is an economically relevant and potent drug in the treatment of type-2 diabetes mellitus. In this study, we present the detection of transcription start sites on this genome by sequencing enriched 5'-ends of primary transcripts. Altogether, 1427 putative transcription start sites were initially identified. With help of the annotated genome sequence, 661 transcription start sites were found to belong to the leader region of protein-coding genes with the surprising result that roughly 20% of these genes rank among the class of leaderless transcripts. Next, conserved promoter motifs were identified for protein-coding genes with and without leader sequences. The mapped transcription start sites were finally used to improve the annotation of the Actinoplanes sp. SE50/110 genome sequence. Concerning protein-coding genes, 41 translation start sites were corrected and 9 novel protein-coding genes could be identified. In addition to this, 122 previously undetermined non-coding RNA (ncRNA) genes of Actinoplanes sp. SE50/110 were defined. Focusing on antisense transcription start sites located within coding genes or their leader sequences, it was discovered that 96 of those ncRNA genes belong to the class of antisense RNA (asRNA) genes. The remaining 26 ncRNA genes were found outside of known protein-coding genes. Four chosen examples of prominent ncRNA genes, namely the transfer messenger RNA gene ssrA, the ribonuclease P class A RNA gene rnpB, the cobalamin riboswitch RNA gene cobRS, and the selenocysteine-specific tRNA gene selC, are presented in more detail. This study demonstrates that sequencing of enriched 5'-ends of primary transcripts and the identification of transcription start sites are valuable tools for advanced genome annotation of Actinoplanes sp. SE50/110 and most probably also for other bacteria.

  20. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation

    PubMed Central

    Yan, Wen-Jing; Li, Xiaobai; Wang, Su-Jing; Zhao, Guoying; Liu, Yong-Jin; Chen, Yu-Hsin; Fu, Xiaolan

    2014-01-01

    A robust automatic micro-expression recognition system would have broad applications in national safety, police interrogation, and clinical diagnosis. Developing such a system requires high quality databases with sufficient training samples which are currently not available. We reviewed the previously developed micro-expression databases and built an improved one (CASME II), with higher temporal resolution (200 fps) and spatial resolution (about 280×340 pixels on facial area). We elicited participants' facial expressions in a well-controlled laboratory environment and proper illumination (such as removing light flickering). Among nearly 3000 facial movements, 247 micro-expressions were selected for the database with action units (AUs) and emotions labeled. For baseline evaluation, LBP-TOP and SVM were employed respectively for feature extraction and classifier with the leave-one-subject-out cross-validation method. The best performance is 63.41% for 5-class classification. PMID:24475068

  1. Annotating the biomedical literature for the human variome

    PubMed Central

    Verspoor, Karin; Jimeno Yepes, Antonio; Cavedon, Lawrence; McIntosh, Tara; Herten-Crabb, Asha; Thomas, Zoë; Plazzer, John-Paul

    2013-01-01

    This article introduces the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases. The schema has been applied to a small corpus of full text journal publications on the subject of inherited colorectal cancer. We show that the inter-annotator agreement on annotation of this corpus ranges from 0.78 to 0.95 F-score across different entity types when exact matching is measured, and improves to a minimum F-score of 0.87 when boundary matching is relaxed. Relations show more variability in agreement, but several are reliable, with the highest, cohort-has-size, reaching 0.90 F-score. We also explore the relevance of the schema to the InSiGHT database curation process. The schema and the corpus represent an important new resource for the development of text mining solutions that address relationships among patient cohorts, disease and genetic variation, and therefore, we also discuss the role text mining might play in the curation of information related to the human variome. The corpus is available at http://opennicta.com/home/health/variome. PMID:23584833

  2. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    PubMed Central

    2010-01-01

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene Atlas was generated from 7.2 million unique digital gene expression tag sequences (300.2 million total raw tag sequences), from which 1.59 million unique tag sequences were identified that mapped to the draft bovine genome accounting for 85% of the total raw tag abundance. Filtering these tags yielded 87,764 unique tag sequences that unambiguously mapped to 16,517 annotated protein-coding loci in the draft genome accounting for 45% of the total raw tag abundance. Clustering of tissues based on tag abundance profiles generally confirmed ontology classification based on anatomy. There were 5,429 constitutively expressed loci and 3,445 constitutively expressed unique tag sequences mapping outside annotated gene boundaries that represent a resource for enhancing current gene models. Physical measures such as inferred transcript length or antisense tag abundance identified tissues with atypical transcriptional tag profiles. We report for the first time the tissue-specific variation in the proportion of mitochondrial transcriptional tag abundance. Conclusions The Bovine Gene Atlas is the deepest and broadest transcriptome survey of any livestock genome to date. Commonalities and variation in sense and antisense transcript tag profiles identified in different tissues facilitate the examination of the relationship between gene expression, tissue, and gene function. PMID:20961407

  3. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper

  4. Publicly Available Database : Improved Spectral Line Measurements In SDSS DR7 Galaxies

    NASA Astrophysics Data System (ADS)

    Oh, Kyuseok; Sarzi, M.; Schawinski, K.; Yi, S. K.

    2012-01-01

    We present a new database of absorption and emission line measurements based on the Sloan Digital Sky Survey 7th data release for the galaxies within a redshift of 0.2. Our work makes use of the publicly available penalized pixel-fitting(pPXF) and GANDALF codes, aiming to improve the existing measurements for stellar kinematics, the strength of various absorption-line features, and the flux and width of the emissions from different species of ionized gas. The absorption line strengths measured by SDSS pipeline are seriously contaminated by emission fill-in. We effectively separate emission lines from absorption lines. For instance, this work successfully extract [NI] doublet from Mgb and it leads to more realistic result of alpha enhancement on late-type galaxies compared to the previous database. Besides accurately measuring line strengths, the database provides new parameters that are indicative of line strength measurement quality. Users can build a subset of database optimal for their studies using specific cuts in the fitting quality parameters as well as empirical signal-to-noise. Applying these parameters, we found `hidden’ broad-line-region galaxies and they turned out to be Seyfert I nuclei that were not picked up as AGN by SDSS. The database is publicly available at http://gem.yonsei.ac.kr/ossy

  5. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  6. Updating annotations with the distributed annotation system and the automated sequence annotation pipeline

    PubMed Central

    Speier, William; Ochs, Michael F.

    2012-01-01

    Summary: The integration between BioDAS ProServer and Automated Sequence Annotation Pipeline (ASAP) provides an interface for querying diverse annotation sources, chaining and linking results, and standardizing the output using the Distributed Annotation System (DAS) protocol. This interface allows pipeline plans in ASAP to be integrated into any system using HTTP and also allows the information returned by ASAP to be included in the DAS registry for use in any DAS-aware system. Three example implementations have been developed: the first accesses TRANSFAC information to automatically create gene sets for the Coordinated Gene Activity in Pattern Sets (CoGAPS) algorithm; the second integrates annotations from multiple array platforms and provides unified annotations in an R environment; and the third wraps the UniProt database for integration with the SPICE DAS client. Availability: Source code for ASAP 2.7 and the DAS 1.6 interface is available under the GNU public license. Proserver 2.20 is free software available from SourceForge. Scripts for installation and configuration on Linux are provided at our website: http://www.rits.onc.jhmi.edu/dbb/custom/A6/ Contact: Speier@mii.ucla.edu or mfo@jhu.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22945787

  7. Gene Ontology Annotations and Resources

    PubMed Central

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  8. ToxGen: an improved reference database for the identification of type B-trichothecene genotypes in Fusarium

    PubMed Central

    2017-01-01

    Type B trichothecenes, which pose a serious hazard to consumer health, occur worldwide in grains. These mycotoxins are produced mainly by three different trichothecene genotypes/chemotypes: 3ADON (3-acetyldeoxynivalenol), 15ADON (15-acetyldeoxynivalenol) and NIV (nivalenol), named after these three major mycotoxin compounds. Correct identification of these genotypes is elementary for all studies relating to population surveys, fungal ecology and mycotoxicology. Trichothecene producers exhibit enormous strain-dependent chemical diversity, which may result in variation in levels of the genotype’s determining toxin and in the production of low to high amounts of atypical compounds. New high-throughput DNA-sequencing technologies promise to boost the diagnostics of mycotoxin genotypes. However, this requires a reference database containing a satisfactory taxonomic sampling of sequences showing high correlation to actually produced chemotypes. We believe that one of the most pressing current challenges of such a database is the linking of molecular identification with chemical diversity of the strains, as well as other metadata. In this study, we use the Tri12 gene involved in mycotoxin biosynthesis for identification of Tri genotypes through sequence comparison. Tri12 sequences from a range of geographically diverse fungal strains comprising 22 Fusarium species were stored in the ToxGen database, which covers descriptive and up-to-date annotations such as indication on Tri genotype and chemotype of the strains, chemical diversity, information on trichothecene-inducing host, substrate or media, geographical locality, and most recent taxonomic affiliations. The present initiative bridges the gap between the demands of comprehensive studies on trichothecene producers and the existing nucleotide sequence databases, which lack toxicological and other auxiliary data. We invite researchers working in the fields of fungal taxonomy, epidemiology and mycotoxicology to join the

  9. Computer systems for annotation of single molecule fragments

    DOEpatents

    Schwartz, David Charles; Severin, Jessica

    2016-07-19

    There are provided computer systems for visualizing and annotating single molecule images. Annotation systems in accordance with this disclosure allow a user to mark and annotate single molecules of interest and their restriction enzyme cut sites thereby determining the restriction fragments of single nucleic acid molecules. The markings and annotations may be automatically generated by the system in certain embodiments and they may be overlaid translucently onto the single molecule images. An image caching system may be implemented in the computer annotation systems to reduce image processing time. The annotation systems include one or more connectors connecting to one or more databases capable of storing single molecule data as well as other biomedical data. Such diverse array of data can be retrieved and used to validate the markings and annotations. The annotation systems may be implemented and deployed over a computer network. They may be ergonomically optimized to facilitate user interactions.

  10. Support Vector Machines for Improved Peptide Identification from Tandem Mass Spectrometry Database Search

    SciTech Connect

    Webb-Robertson, Bobbie-Jo M.

    2009-05-06

    Accurate identification of peptides is a current challenge in mass spectrometry (MS) based proteomics. The standard approach uses a search routine to compare tandem mass spectra to a database of peptides associated with the target organism. These database search routines yield multiple metrics associated with the quality of the mapping of the experimental spectrum to the theoretical spectrum of a peptide. The structure of these results make separating correct from false identifications difficult and has created a false identification problem. Statistical confidence scores are an approach to battle this false positive problem that has led to significant improvements in peptide identification. We have shown that machine learning, specifically support vector machine (SVM), is an effective approach to separating true peptide identifications from false ones. The SVM-based peptide statistical scoring method transforms a peptide into a vector representation based on database search metrics to train and validate the SVM. In practice, following the database search routine, a peptides is denoted in its vector representation and the SVM generates a single statistical score that is then used to classify presence or absence in the sample

  11. An extended bioreaction database that significantly improves reconstruction and analysis of genome-scale metabolic networks.

    PubMed

    Stelzer, Michael; Sun, Jibin; Kamphans, Tom; Fekete, Sándor P; Zeng, An-Ping

    2011-11-01

    The bioreaction database established by Ma and Zeng (Bioinformatics, 2003, 19, 270-277) for in silico reconstruction of genome-scale metabolic networks has been widely used. Based on more recent information in the reference databases KEGG LIGAND and Brenda, we upgrade the bioreaction database in this work by almost doubling the number of reactions from 3565 to 6851. Over 70% of the reactions have been manually updated/revised in terms of reversibility, reactant pairs, currency metabolites and error correction. For the first time, 41 spontaneous sugar mutarotation reactions are introduced into the biochemical database. The upgrade significantly improves the reconstruction of genome scale metabolic networks. Many gaps or missing biochemical links can be recovered, as exemplified with three model organisms Homo sapiens, Aspergillus niger, and Escherichia coli. The topological parameters of the constructed networks were also largely affected, however, the overall network structure remains scale-free. Furthermore, we consider the problem of computing biologically feasible shortest paths in reconstructed metabolic networks. We show that these paths are hard to compute and present solutions to find such paths in networks of small and medium size.

  12. Data pre-processing to improve the mining of large feed databases.

    PubMed

    Maroto-Molina, F; Gómez-Cabrera, A; Guerrero-Ginel, J E; Garrido-Varo, A; Sauvant, D; Tran, G; Heuzé, V; Pérez-Marín, D C

    2013-07-01

    The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

  13. SMPDB 2.0: big improvements to the Small Molecule Pathway Database.

    PubMed

    Jewison, Timothy; Su, Yilu; Disfany, Fatemeh Miri; Liang, Yongjie; Knox, Craig; Maciejewski, Adam; Poelzer, Jenna; Huynh, Jessica; Zhou, You; Arndt, David; Djoumbou, Yannick; Liu, Yifeng; Deng, Lu; Guo, An Chi; Han, Beomsoo; Pon, Allison; Wilson, Michael; Rafatnia, Shahrzad; Liu, Philip; Wishart, David S

    2014-01-01

    The Small Molecule Pathway Database (SMPDB, http://www.smpdb.ca) is a comprehensive, colorful, fully searchable and highly interactive database for visualizing human metabolic, drug action, drug metabolism, physiological activity and metabolic disease pathways. SMPDB contains >600 pathways with nearly 75% of its pathways not found in any other database. All SMPDB pathway diagrams are extensively hyperlinked and include detailed information on the relevant tissues, organs, organelles, subcellular compartments, protein cofactors, protein locations, metabolite locations, chemical structures and protein quaternary structures. Since its last release in 2010, SMPDB has undergone substantial upgrades and significant expansion. In particular, the total number of pathways in SMPDB has grown by >70%. Additionally, every previously entered pathway has been completely redrawn, standardized, corrected, updated and enhanced with additional molecular or cellular information. Many SMPDB pathways now include transporter proteins as well as much more physiological, tissue, target organ and reaction compartment data. Thanks to the development of a standardized pathway drawing tool (called PathWhiz) all SMPDB pathways are now much more easily drawn and far more rapidly updated. PathWhiz has also allowed all SMPDB pathways to be saved in a BioPAX format. Significant improvements to SMPDB's visualization interface now make the browsing, selection, recoloring and zooming of pathways far easier and far more intuitive. Because of its utility and breadth of coverage, SMPDB is now integrated into several other databases including HMDB and DrugBank.

  14. Dictionary-driven protein annotation

    PubMed Central

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-01-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  15. Dictionary-driven protein annotation.

    PubMed

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  16. The UCSC Genome Browser database: 2014 update.

    PubMed

    Karolchik, Donna; Barber, Galt P; Casper, Jonathan; Clawson, Hiram; Cline, Melissa S; Diekhans, Mark; Dreszer, Timothy R; Fujita, Pauline A; Guruvadoo, Luvina; Haeussler, Maximilian; Harte, Rachel A; Heitner, Steve; Hinrichs, Angie S; Learned, Katrina; Lee, Brian T; Li, Chin H; Raney, Brian J; Rhead, Brooke; Rosenbloom, Kate R; Sloan, Cricket A; Speir, Matthew L; Zweig, Ann S; Haussler, David; Kuhn, Robert M; Kent, W James

    2014-01-01

    The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a large collection of organisms, primarily vertebrates, with an emphasis on the human and mouse genomes. The Browser's web-based tools provide an integrated environment for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic data sets. As of September 2013, the database contained genomic sequence and a basic set of annotation 'tracks' for ∼90 organisms. Significant new annotations include a 60-species multiple alignment conservation track on the mouse, updated UCSC Genes tracks for human and mouse, and several new sets of variation and ENCODE data. New software tools include a Variant Annotation Integrator that returns predicted functional effects of a set of variants uploaded as a custom track, an extension to UCSC Genes that displays haplotype alleles for protein-coding genes and an expansion of data hubs that includes the capability to display remotely hosted user-provided assembly sequence in addition to annotation data. To improve European access, we have added a Genome Browser mirror (http://genome-euro.ucsc.edu) hosted at Bielefeld University in Germany.

  17. Validating Annotations for Uncharacterized Proteins in Shewanella oneidensis

    PubMed Central

    Louie, Brenton; Tarczy-Hornoch, Peter; Higdon, Roger

    2008-01-01

    Abstract Proteins of unknown function are a barrier to our understanding of molecular biology. Assigning function to these “uncharacterized” proteins is imperative, but challenging. The usual approach is similarity searches using annotation databases, which are useful for predicting function. However, since the performance of these databases on uncharacterized proteins is basically unknown, the accuracy of their predictions is suspect, making annotation difficult. To address this challenge, we developed a benchmark annotation dataset of 30 proteins in Shewanella oneidensis. The proteins in the dataset were originally uncharacterized after the initial annotation of the S. oneidensis proteome in 2002. In the intervening 5 years, the accumulation of new experimental evidence has enabled specific functions to be predicted. We utilized this benchmark dataset to evaluate several commonly utilized annotation databases. According to our criteria, six annotation databases accurately predicted functions for at least 60% of proteins in our dataset. Two of these six even had a “conditional accuracy” of 90%. Conditional accuracy is another evaluation metric we developed which excludes results from databases where no function was predicted. Also, 27 of the 30 proteins' functions were correctly predicted by at least one database. These represent one of the first performance evaluations of annotation databases on uncharacterized proteins. Our evaluation indicates that these databases readily incorporate new information and are accurate in predicting functions for uncharacterized proteins, provided that experimental function evidence exists. PMID:18687039

  18. Semantic annotation in biomedicine: the current landscape.

    PubMed

    Jovanović, Jelena; Bagheri, Ebrahim

    2017-09-22

    The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators.Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today's annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.

  19. Improving N-terminal protein annotation of Plasmodium species based on signal peptide prediction of orthologous proteins

    PubMed Central

    2012-01-01

    Background Signal peptide is one of the most important motifs involved in protein trafficking and it ultimately influences protein function. Considering the expected functional conservation among orthologs it was hypothesized that divergence in signal peptides within orthologous groups is mainly due to N-terminal protein sequence misannotation. Thus, discrepancies in signal peptide prediction of orthologous proteins were used to identify misannotated proteins in five Plasmodium species. Methods Signal peptide (SignalP) and orthology (OrthoMCL) were combined in an innovative strategy to identify orthologous groups showing discrepancies in signal peptide prediction among their protein members (Mixed groups). In a comparative analysis, multiple alignments for each of these groups and gene models were visually inspected in search of misannotated proteins and, whenever possible, alternative gene models were proposed. Thresholds for signal peptide prediction parameters were also modified to reduce their impact as a possible source of discrepancy among orthologs. Validation of new gene models was based on RT-PCR (few examples) or on experimental evidence already published (ApiLoc). Results The rate of misannotated proteins was significantly higher in Mixed groups than in Positive or Negative groups, corroborating the proposed hypothesis. A total of 478 proteins were reannotated and change of signal peptide prediction from negative to positive was the most common. Reannotations triggered the conversion of almost 50% of all Mixed groups, which were further reduced by optimization of signal peptide prediction parameters. Conclusions The methodological novelty proposed here combining orthology and signal peptide prediction proved to be an effective strategy for the identification of proteins showing wrongly N-terminal annotated sequences, and it might have an important impact in the available data for genome-wide searching of potential vaccine and drug targets and proteins

  20. Improving N-terminal protein annotation of Plasmodium species based on signal peptide prediction of orthologous proteins.

    PubMed

    Neto, Armando de Menezes; Alvarenga, Denise A; Rezende, Antônio M; Resende, Sarah S; Ribeiro, Ricardo de Souza; Fontes, Cor J F; Carvalho, Luzia H; de Brito, Cristiana F Alves

    2012-11-15

    Signal peptide is one of the most important motifs involved in protein trafficking and it ultimately influences protein function. Considering the expected functional conservation among orthologs it was hypothesized that divergence in signal peptides within orthologous groups is mainly due to N-terminal protein sequence misannotation. Thus, discrepancies in signal peptide prediction of orthologous proteins were used to identify misannotated proteins in five Plasmodium species. Signal peptide (SignalP) and orthology (OrthoMCL) were combined in an innovative strategy to identify orthologous groups showing discrepancies in signal peptide prediction among their protein members (Mixed groups). In a comparative analysis, multiple alignments for each of these groups and gene models were visually inspected in search of misannotated proteins and, whenever possible, alternative gene models were proposed. Thresholds for signal peptide prediction parameters were also modified to reduce their impact as a possible source of discrepancy among orthologs. Validation of new gene models was based on RT-PCR (few examples) or on experimental evidence already published (ApiLoc). The rate of misannotated proteins was significantly higher in Mixed groups than in Positive or Negative groups, corroborating the proposed hypothesis. A total of 478 proteins were reannotated and change of signal peptide prediction from negative to positive was the most common. Reannotations triggered the conversion of almost 50% of all Mixed groups, which were further reduced by optimization of signal peptide prediction parameters. The methodological novelty proposed here combining orthology and signal peptide prediction proved to be an effective strategy for the identification of proteins showing wrongly N-terminal annotated sequences, and it might have an important impact in the available data for genome-wide searching of potential vaccine and drug targets and proteins involved in host/parasite interactions

  1. Improvement of the Database on the 1.13-microns Band of Water Vapor

    NASA Technical Reports Server (NTRS)

    Giver, Lawrence P.; Schwenke, David W.; Chackerian, Charles, Jr.; Varanasi, Prasad; Freedman, Richard S.; Gore, Warren J. (Technical Monitor)

    2000-01-01

    Corrections have recently been reported (Giver et al.) on the short-wave (visible and near-infrared) line intensities of water vapor that were catalogued in the spectroscopic database known as HITRAN. These updates have been posted on www.hitran.com, and are being used to reanalyze the polar stratospheric absorption in the 0.94 microns band as observed in POAM. We are currently investigating additional improvement in the 1.13 microns band using data obtained by us with an absorption path length of 1.107 km and 4 torr of water vapor and the ab initio line list of Partridge and Schwenke (needs ref). We are proposing the following four types of improvement of the HITRAN database in this region: 1) HITRAN has nearly 200 lines in this region without proper assignments of rotational quantum levels. Nearly all of them can now be assigned. 2) We have measured positions of the observable H2O-17 and H2O-18 lines. These lines in HITRAN currently have approximate positions based upon rather aged computations. 3) Some additional lines are observed and assigned which should be included in the database. 4) Corrections are necessary for the lower state energies E" for the HITRAN lines of the 121-010 "hot" band.

  2. Improved classification of mass spectrometry database search results using newer machine learning approaches.

    PubMed

    Ulintz, Peter J; Zhu, Ji; Qin, Zhaohui S; Andrews, Philip C

    2006-03-01

    Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins.

  3. Improvement of the Database on the 1.13-microns Band of Water Vapor

    NASA Technical Reports Server (NTRS)

    Giver, Lawrence P.; Schwenke, David W.; Chackerian, Charles, Jr.; Varanasi, Prasad; Freedman, Richard S.; Gore, Warren J. (Technical Monitor)

    2000-01-01

    Corrections have recently been reported (Giver et al.) on the short-wave (visible and near-infrared) line intensities of water vapor that were catalogued in the spectroscopic database known as HITRAN. These updates have been posted on www.hitran.com, and are being used to reanalyze the polar stratospheric absorption in the 0.94 microns band as observed in POAM. We are currently investigating additional improvement in the 1.13 microns band using data obtained by us with an absorption path length of 1.107 km and 4 torr of water vapor and the ab initio line list of Partridge and Schwenke (needs ref). We are proposing the following four types of improvement of the HITRAN database in this region: 1) HITRAN has nearly 200 lines in this region without proper assignments of rotational quantum levels. Nearly all of them can now be assigned. 2) We have measured positions of the observable H2O-17 and H2O-18 lines. These lines in HITRAN currently have approximate positions based upon rather aged computations. 3) Some additional lines are observed and assigned which should be included in the database. 4) Corrections are necessary for the lower state energies E" for the HITRAN lines of the 121-010 "hot" band.

  4. Recommendations for Improving the National Education Fiscal Database. Education Data Improvement Project.

    ERIC Educational Resources Information Center

    Clements, Barbara S.; And Others

    A summary of recommendations for improving the fiscal data portion of the Center for Education Statistics' Common Core of Data includes a discussion of the issues surrounding each recommendation. The following actions are recommended: (1) revenues and expenditures should include a breakdown of function by funds, and further detail should be…

  5. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.

    PubMed

    Quast, Christian; Pruesse, Elmar; Yilmaz, Pelin; Gerken, Jan; Schweer, Timmy; Yarza, Pablo; Peplies, Jörg; Glöckner, Frank Oliver

    2013-01-01

    SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.

  6. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes

    PubMed Central

    Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E.; Thomas, Paul D.

    2016-01-01

    We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This ‘GO Phylogenetic Annotation’ approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations. Database URL: http://amigo.geneontology.org/amigo PMID:28025345

  7. Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis

    PubMed Central

    Done, Bogdan; Khatri, Purvesh; Done, Arina; Draghici, Sorin

    2013-01-01

    The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach. PMID:20150671

  8. Harnessing Collaborative Annotations on Online Formative Assessments

    ERIC Educational Resources Information Center

    Lin, Jian-Wei; Lai, Yuan-Cheng

    2013-01-01

    This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…

  9. Harnessing Collaborative Annotations on Online Formative Assessments

    ERIC Educational Resources Information Center

    Lin, Jian-Wei; Lai, Yuan-Cheng

    2013-01-01

    This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…

  10. Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports.

    PubMed

    Chapman, Wendy W; Dowling, John N; Hripcsak, George

    2008-02-01

    Determine whether agreement among annotators improves after being trained to use an annotation schema that specifies: what types of clinical conditions to annotate, the linguistic form of the annotations, and which modifiers to include. Three physicians and 3 lay people individually annotated all clinical conditions in 23 emergency department reports. For annotations made using a Baseline Schema and annotations made after training on a detailed annotation schema, we compared: (1) variability of annotation length and number and (2) annotator agreement, using the F-measure. Physicians showed higher agreement and lower variability after training on the detailed annotation schema than when applying the Baseline Schema. Lay people agreed with physicians almost as well as other physicians did but showed a slower learning curve. Training annotators on the annotation schema we developed increased agreement among annotators and should be useful in generating reference standard sets for natural language processing studies. The methodology we used to evaluate the schema could be applied to other types of annotation or classification tasks in biomedical informatics.

  11. CyanoBase: the cyanobacteria genome database update 2010.

    PubMed

    Nakao, Mitsuteru; Okamoto, Shinobu; Kohara, Mitsuyo; Fujishiro, Tsunakazu; Fujisawa, Takatomo; Sato, Shusei; Tabata, Satoshi; Kaneko, Takakazu; Nakamura, Yasukazu

    2010-01-01

    CyanoBase (http://genome.kazusa.or.jp/cyanobase) is the genome database for cyanobacteria, which are model organisms for photosynthesis. The database houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. In this version, we updated these datasets and improved the navigation and the visual display of the data views. In addition, a web service API now enables users to retrieve the data in various formats with other tools, seamlessly.

  12. CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences

    PubMed Central

    Chen, Xianfeng; Laudeman, Thomas W; Rushton, Paul J; Spraggins, Thomas A; Timko, Michael P

    2007-01-01

    Background Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data. Description CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS) isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs) knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource), and UniProtKB-TrEMBL). Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the potential domains on

  13. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http

  14. An alternative database approach for management of SNOMED CT and improved patient data queries.

    PubMed

    Campbell, W Scott; Pedersen, Jay; McClay, James C; Rao, Praveen; Bastola, Dhundy; Campbell, James R

    2015-10-01

    statistics generated using the graph database were identical to those using validated methods. Patient queries produced identical patient count results to the Oracle RDBMS with comparable times. Database queries involving defining attributes of SNOMED CT concepts were possible with the graph DB. The same queries could not be directly performed with the Oracle RDBMS representation of the patient data and required the creation and use of external terminology services. Further, queries of undefined depth were successful in identifying unknown relationships between patient cohorts. The results of this study supported the hypothesis that a patient database built upon and around the semantic model of SNOMED CT was possible. The model supported queries that leveraged all aspects of the SNOMED CT logical model to produce clinically relevant query results. Logical disjunction and negation queries were possible using the data model, as well as, queries that extended beyond the structural IS_A hierarchy of SNOMED CT to include queries that employed defining attribute-values of SNOMED CT concepts as search parameters. As medical terminologies, such as SNOMED CT, continue to expand, they will become more complex and model consistency will be more difficult to assure. Simultaneously, consumers of data will increasingly demand improvements to query functionality to accommodate additional granularity of clinical concepts without sacrificing speed. This new line of research provides an alternative approach to instantiating and querying patient data represented using advanced computable clinical terminologies. Copyright © 2015 Elsevier Inc. All rights reserved.

  15. Annotating the human genome with Disease Ontology

    PubMed Central

    Osborne, John D; Flatow, Jared; Holko, Michelle; Lin, Simon M; Kibbe, Warren A; Zhu, Lihua (Julie); Danila, Maria I; Feng, Gang; Chisholm, Rex L

    2009-01-01

    Background The human genome has been extensively annotated with Gene Ontology for biological functions, but minimally computationally annotated for diseases. Results We used the Unified Medical Language System (UMLS) MetaMap Transfer tool (MMTx) to discover gene-disease relationships from the GeneRIF database. We utilized a comprehensive subset of UMLS, which is disease-focused and structured as a directed acyclic graph (the Disease Ontology), to filter and interpret results from MMTx. The results were validated against the Homayouni gene collection using recall and precision measurements. We compared our results with the widely used Online Mendelian Inheritance in Man (OMIM) annotations. Conclusion The validation data set suggests a 91% recall rate and 97% precision rate of disease annotation using GeneRIF, in contrast with a 22% recall and 98% precision using OMIM. Our thesaurus-based approach allows for comparisons to be made between disease containing databases and allows for increased accuracy in disease identification through synonym matching. The much higher recall rate of our approach demonstrates that annotating human genome with Disease Ontology and GeneRIF for diseases dramatically increases the coverage of the disease annotation of human genome. PMID:19594883

  16. Improving burn care and preventing burns by establishing a burn database in Ukraine.

    PubMed

    Fuzaylov, Gennadiy; Murthy, Sushila; Dunaev, Alexander; Savchyn, Vasyl; Knittel, Justin; Zabolotina, Olga; Dylewski, Maggie L; Driscoll, Daniel N

    2014-08-01

    Burns are a challenge for trauma care and a contribution to the surgical burden. The former Soviet republic of Ukraine has a foundation for burn care; however data concerning burns in Ukraine has historically been scant. The objective of this paper was to compare a new burn database to identify problems and implement improvements in burn care and prevention in this country. Retrospective analyses of demographic and clinical data of burn patients including Tukey's post hoc test, analysis of variance, and chi square analyses, and Fisher's exact test were used. Data were compared to the American Burn Association (ABA) burn repository. This study included 1752 thermally injured patients treated in 20 hospitals including Specialized Burn Unit in Municipal Hospital #8 Lviv, Lviv province in Ukraine. Scald burns were the primary etiology of burns injuries (70%) and burns were more common among children less than five years of age (34%). Length of stay, mechanical ventilation use, infection rates, and morbidity increased with greater burn size. Mortality was significantly related to burn size, inhalation injury, age, and length of stay. Wound infections were associated with burn size and older age. Compared to ABA data, Ukrainian patients had double the length of stay and a higher rate of wound infections (16% vs. 2.4%). We created one of the first burn databases from a region of the former Soviet Union in an effort to bring attention to burn injury and improve burn care. Copyright © 2013 Elsevier Ltd and ISBI. All rights reserved.

  17. An Improved Algorithm to Generate a Wi-Fi Fingerprint Database for Indoor Positioning

    PubMed Central

    Chen, Lina; Li, Binghao; Zhao, Kai; Rizos, Chris; Zheng, Zhengqi

    2013-01-01

    The major problem of Wi-Fi fingerprint-based positioning technology is the signal strength fingerprint database creation and maintenance. The significant temporal variation of received signal strength (RSS) is the main factor responsible for the positioning error. A probabilistic approach can be used, but the RSS distribution is required. The Gaussian distribution or an empirically-derived distribution (histogram) is typically used. However, these distributions are either not always correct or require a large amount of data for each reference point. Double peaks of the RSS distribution have been observed in experiments at some reference points. In this paper a new algorithm based on an improved double-peak Gaussian distribution is proposed. Kurtosis testing is used to decide if this new distribution, or the normal Gaussian distribution, should be applied. Test results show that the proposed algorithm can significantly improve the positioning accuracy, as well as reduce the workload of the off-line data training phase. PMID:23966197

  18. Improving Quality and Quantity of Contributions: Two Models for Promoting Knowledge Exchange with Shared Databases

    ERIC Educational Resources Information Center

    Cress, U.; Barquero, B.; Schwan, S.; Hesse, F. W.

    2007-01-01

    Shared databases are used for knowledge exchange in groups. Whether a person is willing to contribute knowledge to a shared database presents a social dilemma: Each group member saves time and energy by not contributing any information to the database and by using the database only to retrieve information which was contributed by others. But if…

  19. Improving Quality and Quantity of Contributions: Two Models for Promoting Knowledge Exchange with Shared Databases

    ERIC Educational Resources Information Center

    Cress, U.; Barquero, B.; Schwan, S.; Hesse, F. W.

    2007-01-01

    Shared databases are used for knowledge exchange in groups. Whether a person is willing to contribute knowledge to a shared database presents a social dilemma: Each group member saves time and energy by not contributing any information to the database and by using the database only to retrieve information which was contributed by others. But if…

  20. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    PubMed Central

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  1. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies.

    PubMed

    Lu, Qiongshi; Powles, Ryan Lee; Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-04-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.

  2. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.

    PubMed

    South, Brett R; Mowery, Danielle; Suo, Ying; Leng, Jianwei; Ferrández, Óscar; Meystre, Stephane M; Chapman, Wendy W

    2014-08-01

    The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method requires removal of 18 types of protected health information (PHI) from clinical documents to be considered "de-identified" prior to use for research purposes. Human review of PHI elements from a large corpus of clinical documents can be tedious and error-prone. Indeed, multiple annotators may be required to consistently redact information that represents each PHI class. Automated de-identification has the potential to improve annotation quality and reduce annotation time. For instance, using machine-assisted annotation by combining de-identification system outputs used as pre-annotations and an interactive annotation interface to provide annotators with PHI annotations for "curation" rather than manual annotation from "scratch" on raw clinical documents. In order to assess whether machine-assisted annotation improves the reliability and accuracy of the reference standard quality and reduces annotation effort, we conducted an annotation experiment. In this annotation study, we assessed the generalizability of the VA Consortium for Healthcare Informatics Research (CHIR) annotation schema and guidelines applied to a corpus of publicly available clinical documents called MTSamples. Specifically, our goals were to (1) characterize a heterogeneous corpus of clinical documents manually annotated for risk-ranked PHI and other annotation types (clinical eponyms and person relations), (2) evaluate how well annotators apply the CHIR schema to the heterogeneous corpus, (3) compare whether machine-assisted annotation (experiment) improves annotation quality and reduces annotation time compared to manual annotation (control), and (4) assess the change in quality of reference standard coverage with each added annotator's annotations.

  3. Database Constraints Applied to Metabolic Pathway Reconstruction Tools

    PubMed Central

    Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi

    2014-01-01

    Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes. PMID:25202745

  4. Improved annotation of 3' untranslated regions and complex loci by combination of strand-specific direct RNA sequencing, RNA-Seq and ESTs.

    PubMed

    Schurch, Nicholas J; Cole, Christian; Sherstnev, Alexander; Song, Junfang; Duc, Céline; Storey, Kate G; McLean, W H Irwin; Brown, Sara J; Simpson, Gordon G; Barton, Geoffrey J

    2014-01-01

    The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct and complete annotation in addition to the underlying genomic sequence is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3' untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3' polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3' UTR re-annotation (including extension of one 3' UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data.

  5. WormBase: Annotating many nematode genomes.

    PubMed

    Howe, Kevin; Davis, Paul; Paulini, Michael; Tuli, Mary Ann; Williams, Gary; Yook, Karen; Durbin, Richard; Kersey, Paul; Sternberg, Paul W

    2012-01-01

    WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase's role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.

  6. Missing genes in the annotation of prokaryotic genomes.

    PubMed

    Warren, Andrew S; Archuleta, Jeremy; Feng, Wu-Chun; Setubal, João Carlos

    2010-03-15

    Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

  7. Improvement of the quality of medical databases: data-mining-based prediction of diagnostic codes from previous patient codes.

    PubMed

    Djennaoui, Mehdi; Ficheur, Grégoire; Beuscart, Régis; Chazard, Emmanuel

    2015-01-01

    Diagnoses and medical procedures collected under the French system of information are recorded in a nationwide database, the "PMSI national database", which is accessible for exploitation. Quality of the data in this database is directly related to the quality of coding, which can be of poor quality. Among the proposed methods for the exploitation of health databases, data mining techniques are particularly interesting. Our objective is to build sequential rules for missing diagnoses prediction by data mining of the PMSI national database. Our working sample was constructed from the national database for years 2007 to 2010. The information retained for rules construction were medical diagnoses and medical procedures. The rules were selected using a statistical filter, and selected rules were validated by case review based on medical letters, which enabled to estimate the improvement of diagnoses recoding. The work sample was made of 59,170 inpatient stays. The predicted ICD codes were E11 (non-insulin-dependent diabetes mellitus), I48 (atrial fibrillation and flutter) and I50 (heart failure).We validated three sequential rules with a substantial improvement of positive predictive value: {E11,I10,DZQM006}=>{E11} {E11,I10,I48}=>{E11} {I48,I69}=>{I48} We were able to extract by data mining three simple, reliable and effective sequential rules, with a substantial improvement in diagnoses recoding. The results of our study indicate the opportunity to improve the data quality of the national database by data mining methods.

  8. Annotation of Fusarium graminearum (PH-1) Version 5.0

    PubMed Central

    Hammond-Kosack, Kim E.

    2017-01-01

    ABSTRACT Fusarium graminearum floral infections are a major risk to the global supply of safe cereal grains. We report updates to the PH-1 reference genome and significant improvements to the annotation. Changes include introduction of legacy annotation identifiers, new gene models, secretome and effectorP predictions, and inclusion of extensive untranslated region (UTR) annotations. PMID:28082505

  9. Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database.

    PubMed

    Davis, Allan Peter; Johnson, Robin J; Lennon-Hopkins, Kelley; Sciaky, Daniela; Rosenstein, Michael C; Wiegers, Thomas C; Mattingly, Carolyn J

    2012-01-01

    The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical-gene, chemical-disease and gene-disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009-2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL

  10. Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database

    PubMed Central

    Davis, Allan Peter; Johnson, Robin J.; Lennon-Hopkins, Kelley; Sciaky, Daniela; Rosenstein, Michael C.; Wiegers, Thomas C.; Mattingly, Carolyn J.

    2012-01-01

    The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical–gene, chemical–disease and gene–disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009–2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database

  11. Automated Data Mining of A Proprietary Database System for Physician Quality Improvement

    SciTech Connect

    Johnstone, Peter Crenshaw, Tim; Cassels, Diane G.; Fox, Timothy H.

    2008-04-01

    Purpose: Physician practice quality improvement is a subject of intense national debate. This report describes using a software data acquisition program to mine an existing, commonly used proprietary radiation oncology database to assess physician performance. Methods and Materials: Between 2003 and 2004, a manual analysis was performed of electronic portal image (EPI) review records. Custom software was recently developed to mine the record-and-verify database and the review process of EPI at our institution. In late 2006, a report was developed that allowed for immediate review of physician completeness and speed of EPI review for any prescribed period. Results: The software extracted >46,000 EPIs between 2003 and 2007, providing EPI review status and time to review by each physician. Between 2003 and 2007, the department EPI review improved from 77% to 97% (range, 85.4-100%), with a decrease in the mean time to review from 4.2 days to 2.4 days. The initial intervention in 2003 to 2004 was moderately successful in changing the EPI review patterns; it was not repeated because of the time required to perform it. However, the implementation in 2006 of the automated review tool yielded a profound change in practice. Using the software, the automated chart review required {approx}1.5 h for mining and extracting the data for the 4-year period. Conclusion: This study quantified the EPI review process as it evolved during a 4-year period at our institution and found that automation of data retrieval and review simplified and facilitated physician quality improvement.

  12. Improving machine operation management efficiency via improving the vehicle park structure and using the production operation information database

    NASA Astrophysics Data System (ADS)

    Koptev, V. Yu

    2017-02-01

    The work represents the results of studying basic interconnected criteria of separate equipment units of the transport network machines fleet, depending on production and mining factors to improve the transport systems management. Justifying the selection of a control system necessitates employing new methodologies and models, augmented with stability and transport flow criteria, accounting for mining work development dynamics on mining sites. A necessary condition is the accounting of technical and operating parameters related to vehicle operation. Modern open pit mining dispatching systems must include such kinds of the information database. An algorithm forming a machine fleet is presented based on multi-variation task solution in connection with defining reasonable operating features of a machine working as a part of a complex. Proposals cited in the work may apply to mining machines (drilling equipment, excavators) and construction equipment (bulldozers, cranes, pile-drivers), city transport and other types of production activities using machine fleet.

  13. Orienteering: An Annotated Bibliography = Orientierungslauf: Eine kommentierte Bibliographie.

    ERIC Educational Resources Information Center

    Seiler, Roland, Ed.; Hartmann, Wolfgang, Ed.

    1994-01-01

    Annotated bibliography of 220 books, monographs, and journal articles on orienteering published 1984-94, from SPOLIT database of the Federal Institute of Sport Science (Cologne, Germany). Annotations in English or German. Ten sections including psychological, physiological, health, sociological, and environmental aspects; training and coaching;…

  14. The University of Minnesota Biocatalysis/Biodegradation Database: improving public access

    PubMed Central

    Gao, Junfeng; Ellis, Lynda B. M.; Wackett, Lawrence P.

    2010-01-01

    The University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD, http://umbbd.msi.umn.edu/) began in 1995 and now contains information on almost 1200 compounds, over 800 enzymes, almost 1300 reactions and almost 500 microorganism entries. Besides these data, it includes a Biochemical Periodic Table (UM-BPT) and a rule-based Pathway Prediction System (UM-PPS) (http://umbbd.msi.umn.edu/predict/) that predicts plausible pathways for microbial degradation of organic compounds. Currently, the UM-PPS contains 260 biotransformation rules derived from reactions found in the UM-BBD and scientific literature. Public access to UM-BBD data is increasing. UM-BBD compound data are now contributed to PubChem and ChemSpider, the public chemical databases. A new mirror website of the UM-BBD, UM-BPT and UM-PPS is being developed at ETH Zürich to improve speed and reliability of online access from anywhere in the world. PMID:19767608

  15. An improved database of coastal flooding in the United Kingdom from 1915 to 2016.

    PubMed

    Haigh, Ivan D; Ozsoy, Ozgun; Wadey, Matthew P; Nicholls, Robert J; Gallop, Shari L; Wahl, Thomas; Brown, Jennifer M

    2017-08-01

    Coastal flooding caused by extreme sea levels can produce devastating and wide-ranging consequences. The 'SurgeWatch' v1.0 database systematically documents and assesses the consequences of historical coastal flood events around the UK. The original database was inevitably biased due to the inconsistent spatial and temporal coverage of sea-level observations utilised. Therefore, we present an improved version integrating a variety of 'soft' data such as journal papers, newspapers, weather reports, and social media. SurgeWatch2.0 identifies 329 coastal flooding events from 1915 to 2016, a more than fivefold increase compared to the 59 events in v1.0. Moreover, each flood event is now ranked using a multi-level categorisation based on inundation, transport disruption, costs, and fatalities: from 1 (Nuisance) to 6 (Disaster). For the 53 most severe events ranked Category 3 and above, an accompanying event description based upon the Source-Pathway-Receptor-Consequence framework was produced. Thus, SurgeWatch v2.0 provides the most comprehensive and coherent historical record of UK coastal flooding. It is designed to be a resource for research, planning, management and education.

  16. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments

    PubMed Central

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598

  17. Web-based Video Annotation and its Applications

    NASA Astrophysics Data System (ADS)

    Yamamoto, Daisuke; Nagao, Katashi

    In this paper, we developed a Web-based video annotation system, named iVAS (intelligent Video Annotation Server). Audiences can associate any video content on the Internet with annotations. The system analyzes video content in order to acquire cut/shot information and color histograms. And it also automatically generates a Web page for editing annotations. Then, audiences can create annotation data by two methods. The first one helps the users to create text data such as person/object names, scene descriptions, and comments interactively. The second method facilitates the users associating any video fragments with their subjective impression by just clicking a mouse button. The generated annotation data are accumulated and managed by an XML database connected with iVAS. We also developed some application systems based on annotations such as video retrieval, video simplification, and video-content-based community support. One of the major advantages of our approach is easy integration of hand-coded and automatically-generated (such as color histograms and cut/shot information) annotations. Additionally, since our annotation system is open for public, we must consider some reliability or correctness of annotation data. We also developed an automatic evaluation method of annotation reliability using the users' feedback. In the future, these fundamental technologies will contribute to the formation of new communities centered around video content.

  18. The Basic Course: A Selected, Annotated Bibliography.

    ERIC Educational Resources Information Center

    Demo, Penny

    Defining basic speech communication courses as those public speaking, interpersonal, or communication courses that treat fundamental communication concepts, this annotated bibliography reflects the current thought of speech educators on the basic course. The bibliography consists of 27 citations, all of which are drawn from the ERIC database. (SKC)

  19. Computing human image annotation.

    PubMed

    Channin, David S; Mongkolwat, Pattanasak; Kleper, Vladimir; Rubin, Daniel L

    2009-01-01

    An image annotation is the explanatory or descriptive information about the pixel data of an image that is generated by a human (or machine) observer. An image markup is the graphical symbols placed over the image to depict an annotation. In the majority of current, clinical and research imaging practice, markup is captured in proprietary formats and annotations are referenced only in free text radiology reports. This makes these annotations difficult to query, retrieve and compute upon, hampering their integration into other data mining and analysis efforts. This paper describes the National Cancer Institute's Cancer Biomedical Informatics Grid's (caBIG) Annotation and Image Markup (AIM) project, focusing on how to use AIM to query for annotations. The AIM project delivers an information model for image annotation and markup. The model uses controlled terminologies for important concepts. All of the classes and attributes of the model have been harmonized with the other models and common data elements in use at the National Cancer Institute. The project also delivers XML schemata necessary to instantiate AIMs in XML as well as a software application for translating AIM XML into DICOM S/R and HL7 CDA. Large collections of AIM annotations can be built and then queried as Grid or Web services. Using the tools of the AIM project, image annotations and their markup can be captured and stored in human and machine readable formats. This enables the inclusion of human image observation and inference as part of larger data mining and analysis activities.

  20. Galileo Reader and Annotator

    NASA Astrophysics Data System (ADS)

    Besomi, O.

    2011-06-01

    In his readings, Galileo made frequent use of annotations. Here, I will offer a general glance at them by discussing the case of the annotations to the Libra astronomica published in 1619 by Orazio Grassi, a Jesuit mathematician of the Collegio Romano. The annotations directly reflect Galileo's reaction to Grassi's book in a heated debate between the two astronomers. Galileo and Grassi had opposite ideas about the nature of the comets, which resulted in different scientific and theological implications. The annotations represent the starting point for Galileo's reply to the Libra, namely Il Saggiatore, which was published four years later and dedicated to the new pope Urban VIII.

  1. Real time metagenomics: using k-mers to annotate metagenomes.

    PubMed

    Edwards, Robert A; Olson, Robert; Disz, Terry; Pusch, Gordon D; Vonstein, Veronika; Stevens, Rick; Overbeek, Ross

    2012-12-15

    Annotation of metagenomes involves comparing the individual sequence reads with a database of known sequences and assigning a unique function to each read. This is a time-consuming task that is computationally intensive (though not computationally complex). Here we present a novel approach to annotate metagenomes using unique k-mer oligopeptide sequences from 7 to 12 amino acids long. We demonstrate that k-mer-based annotations are faster and approach the sensitivity and precision of blastx-based annotations without loosing accuracy. A last-common ancestor approach was also developed to describe the members of the community.

  2. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction.

    PubMed

    Névéol, Aurélie; Islamaj Doğan, Rezarta; Lu, Zhiyong

    2011-04-01

    Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.

  3. Semi-automatic semantic annotation of PubMed Queries: a study on quality, efficiency, satisfaction

    PubMed Central

    Névéol, Aurélie; Islamaj-Doğan, Rezarta; Lu, Zhiyong

    2010-01-01

    Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical information queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations. PMID:21094696

  4. T4SP Database 2.0: An Improved Database for Type IV Secretion Systems in Bacterial Genomes with New Online Analysis Tools.

    PubMed

    Han, Na; Yu, Weiwen; Qiang, Yujun; Zhang, Wen

    2016-01-01

    Type IV secretion system (T4SS) can mediate the passage of macromolecules across cellular membranes and is essential for virulent and genetic material exchange among bacterial species. The Type IV Secretion Project 2.0 (T4SP 2.0) database is an improved and extended version of the platform released in 2013 aimed at assisting with the detection of Type IV secretion systems (T4SS) in bacterial genomes. This advanced version provides users with web server tools for detecting the existence and variations of T4SS genes online. The new interface for the genome browser provides a user-friendly access to the most complete and accurate resource of T4SS gene information (e.g., gene number, name, type, position, sequence, related articles, and quick links to other webs). Currently, this online database includes T4SS information of 5239 bacterial strains. Conclusions. T4SS is one of the most versatile secretion systems necessary for the virulence and survival of bacteria and the secretion of protein and/or DNA substrates from a donor to a recipient cell. This database on virB/D genes of the T4SS system will help scientists worldwide to improve their knowledge on secretion systems and also identify potential pathogenic mechanisms of various microbial species.

  5. A Radiocarbon Database for Improving Understanding of Global Soil Carbon Dynamics: Part I

    NASA Astrophysics Data System (ADS)

    Torn, M. S.; Trumbore, S.; Smith, L. J.; Nave, L. E.; Sierra, C. A.; Harden, J. W.; Agarwal, D.; van Ingen, C.; Radiocarbon Database Workshop 2011

    2011-12-01

    Soils play a large role in the global carbon cycle, but soil carbon stocks and dynamics remain highly uncertain. Radiocarbon (14C) observations from soils and soil respiration provide one of the only ways to infer terrestrial carbon turnover times or to test ecosystem carbon models. Although a wealth of such observations exists, they are scattered in small data sets held by individual researchers, and have not been compiled in a form easy to use for multi-site analysis, global assessments, or model testing. Here we introduce a new, global radiocarbon database that will synthesize datasets from multiple contributors to facilitate research on three broad questions: (1) What are current patterns of soil carbon dynamics, and what factors influence these patterns? (2) What is the sequestration capacity of different soils? (3) What are likely impacts of global change on the soil resource? (4) How well do models represent important carbon cycle processes, and how can they be improved? In addition to assembling data in a common format for analyses, this database will offer query capabilities and the ability to combine data with gridded global products, such as temporally resolved temperature and precipitation, NPP and GPP, and a climate-based decomposition index. Some of the near-term synthesis goals include analyzing depth profiles of 14C for across gradients in ecosystem state factors (climate, organisms, relief, parent material, time, and human influence) and soil orders; mapping surface-soil 14C values on soil temperature and moisture; and comparing soil carbon turnover times to NPP and soil carbon stocks. We are currently incorporating data from 18 contributors and six continents, with 14C measurements from soils representing nine soil orders, plant and microbial tissues, and respiration fluxes. Our intention is to grow the database and make it available to a wide community of scientists. For example, observations for different disturbance, experimental treatment, or

  6. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation

    PubMed Central

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M.; Scannapieco, Frank A.

    2016-01-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease. PMID:25911337

  7. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation.

    PubMed

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M; Scannapieco, Frank A

    2015-06-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

  8. Evaluating techniques for metagenome annotation using simulated sequence data

    PubMed Central

    Randle-Boggis, Richard J.; Helgason, Thorunn; Sapp, Melanie; Ashton, Peter D.

    2016-01-01

    The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is to identify from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies. PMID:27162180

  9. Transparency, usability, and reproducibility: Guiding principles for improving comparative databases using primates as examples.

    PubMed

    Borries, Carola; Sandel, Aaron A; Koenig, Andreas; Fernandez-Duque, Eduardo; Kamilar, Jason M; Amoroso, Caroline R; Barton, Robert A; Bray, Joel; Di Fiore, Anthony; Gilby, Ian C; Gordon, Adam D; Mundry, Roger; Port, Markus; Powell, Lauren E; Pusey, Anne E; Spriggs, Amanda; Nunn, Charles L

    2016-09-01

    Recent decades have seen rapid development of new analytical methods to investigate patterns of interspecific variation. Yet these cutting-edge statistical analyses often rely on data of questionable origin, varying accuracy, and weak comparability, which seem to have reduced the reproducibility of studies. It is time to improve the transparency of comparative data while also making these improved data more widely available. We, the authors, met to discuss how transparency, usability, and reproducibility of comparative data can best be achieved. We propose four guiding principles: 1) data identification with explicit operational definitions and complete descriptions of methods; 2) inclusion of metadata that capture key characteristics of the data, such as sample size, geographic coordinates, and nutrient availability (for example, captive versus wild animals); 3) documentation of the original reference for each datum; and 4) facilitation of effective interactions with the data via user friendly and transparent interfaces. We urge reviewers, editors, publishers, database developers and users, funding agencies, researchers publishing their primary data, and those performing comparative analyses to embrace these standards to increase the transparency, usability, and reproducibility of comparative studies. © 2016 Wiley Periodicals, Inc.

  10. Annotated Humanities Programs.

    ERIC Educational Resources Information Center

    Adler, Richard R.; Applebee, Arthur

    The humanities programs offered in 1968 by 227 United States secondary schools are listed alphabetically by state, including almost 100 new programs not annotated in the 1967 listing (see TE 000 224). Each annotation presents a brief description of the approach to study used in the particular humanities course (e.g., American Studies, Culture…

  11. SEED Software Annotations.

    ERIC Educational Resources Information Center

    Bethke, Dee; And Others

    This document provides a composite index of the first five sets of software annotations produced by Project SEED. The software has been indexed by title, subject area, and grade level, and it covers sets of annotations distributed in September 1986, April 1987, September 1987, November 1987, and February 1988. The date column in the index…

  12. Development and Evaluation of an Automated Annotation Pipeline and cDNA Annotation System

    PubMed Central

    Kasukawa, Takeya; Furuno, Masaaki; Nikaido, Itoshi; Bono, Hidemasa; Hume, David A.; Bult, Carol; Hill, David P.; Baldarelli, Richard; Gough, Julian; Kanapin, Alexander; Matsuda, Hideo; Schriml, Lynn M.; Hayashizaki, Yoshihide; Okazaki, Yasushi; Quackenbush, John

    2003-01-01

    Manual curation has long been held to be the “gold standard” for functional annotation of DNA sequence. Our experience with the annotation of more than 20,000 full-length cDNA sequences revealed problems with this approach, including inaccurate and inconsistent assignment of gene names, as well as many good assignments that were difficult to reproduce using only computational methods. For the FANTOM2 annotation of more than 60,000 cDNA clones, we developed a number of methods and tools to circumvent some of these problems, including an automated annotation pipeline that provides high-quality preliminary annotation for each sequence by introducing an “uninformative filter” that eliminates uninformative annotations, controlled vocabularies to accurately reflect both the functional assignments and the evidence supporting them, and a highly refined, Web-based manual annotation tool that allows users to view a wide array of sequence analyses and to assign gene names and putative functions using a consistent nomenclature. The ultimate utility of our approach is reflected in the low rate of reassignment of automated assignments by manual curation. Based on these results, we propose a new standard for large-scale annotation, in which the initial automated annotations are manually investigated and then computational methods are iteratively modified and improved based on the results of manual curation. PMID:12819153

  13. Scientific and Technical Document Database

    National Institute of Standards and Technology Data Gateway

    NIST Scientific and Technical Document Database (PC database for purchase)   The images in NIST Special Database 20 contain a very rich set of graphic elements from scientific and technical documents, such as graphs, tables, equations, two column text, maps, pictures, footnotes, annotations, and arrays of such elements.

  14. Omics data management and annotation.

    PubMed

    Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron

    2011-01-01

    Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter.

  15. PAnnBuilder: an R package for assembling proteomic annotation data.

    PubMed

    Li, Hong; Ding, Guohui; Xie, Lu; Li, Yixue

    2009-04-15

    PAnnBuilder is an R package to automatically assemble protein annotation information from public resources to provide uniform annotation data for large-scale proteomic studies. Sixteen public databases have been parsed and 54 annotation packages have been constructed based on R environment or SQLite database. These ready-to-use packages cover most frequently needed protein annotation for three model species including human, mouse and rat. Several extended applications such as annotation based on protein sequence similarity are also provided. Sophisticated users can develop their own packages using PAnnBuilder. PAnnBuilder may become an important tool for proteomic research.

  16. Automated Knowledge Annotation for Dynamic Collaborative Environments

    SciTech Connect

    Cowell, Andrew J.; Gregory, Michelle L.; Marshall, Eric J.; McGrath, Liam R.

    2009-05-19

    This paper describes the Knowledge Encapsulation Framework (KEF), a suite of tools to enable automated knowledge annotation for modeling and simulation projects. This framework can be used to capture evidence (e.g., facts extracted from journal articles and government reports), discover new evidence (from similar peer-reviewed material as well as social media), enable discussions surrounding domain-specific topics and provide automatically generated semantic annotations for improved corpus investigation. The current KEF implementation is presented within a wiki environment, providing a simple but powerful collaborative space for team members to review, annotate, discuss and align evidence with their modeling frameworks.

  17. Genome Annotation Transfer Utility (GATU): rapid annotation of viral genomes using a closely related reference genome.

    PubMed

    Tcherepanov, Vasily; Ehlers, Angelika; Upton, Chris

    2006-06-13

    Since DNA sequencing has become easier and cheaper, an increasing number of closely related viral genomes have been sequenced. However, many of these have been deposited in GenBank without annotations, severely limiting their value to researchers. While maintaining comprehensive genomic databases for a set of virus families at the Viral Bioinformatics Resource Center http://www.biovirus.org and Viral Bioinformatics - Canada http://www.virology.ca, we found that researchers were unnecessarily spending time annotating viral genomes that were close relatives of already annotated viruses. We have therefore designed and implemented a novel tool, Genome Annotation Transfer Utility (GATU), to transfer annotations from a previously annotated reference genome to a new target genome, thereby greatly reducing this laborious task. GATU transfers annotations from a reference genome to a closely related target genome, while still giving the user final control over which annotations should be included. GATU also detects open reading frames present in the target but not the reference genome and provides the user with a variety of bioinformatics tools to quickly determine if these ORFs should also be included in the annotation. After this process is complete, GATU saves the newly annotated genome as a GenBank, EMBL or XML-format file. The software is coded in Java and runs on a variety of computer platforms. Its user-friendly Graphical User Interface is specifically designed for users trained in the biological sciences. GATU greatly simplifies the initial stages of genome annotation by using a closely related genome as a reference. It is not intended to be a gene prediction tool or a "complete" annotation system, but we have found that it significantly reduces the time required for annotation of genes and mature peptides as well as helping to standardize gene names between related organisms by transferring reference genome annotations to the target genome. The program is freely

  18. GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation.

    PubMed

    Chung, Dongjun; Yang, Can; Li, Cong; Gelernter, Joel; Zhao, Hongyu

    2014-11-01

    Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are

  19. GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation

    PubMed Central

    Li, Cong; Gelernter, Joel; Zhao, Hongyu

    2014-01-01

    Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are

  20. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments.

    PubMed

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis-clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Database URL: http://www.cbil.upenn.edu/annotCompute/

  1. DFAST and DAGA: web-based integrated genome annotation tools and resources

    PubMed Central

    TANIZAWA, Yasuhiro; FUJISAWA, Takatomo; KAMINUMA, Eli; NAKAMURA, Yasukazu; ARITA, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp. PMID:27867804

  2. DFAST and DAGA: web-based integrated genome annotation tools and resources.

    PubMed

    Tanizawa, Yasuhiro; Fujisawa, Takatomo; Kaminuma, Eli; Nakamura, Yasukazu; Arita, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp.

  3. Improving pharmaceutical innovation by building a more comprehensive database on drug development and use.

    PubMed

    Daniel, Gregory W; Cazé, Alexis; Romine, Morgan H; Audibert, Céline; Leff, Jonathan S; McClellan, Mark B

    2015-02-01

    New drugs and biologics have had a tremendous impact on the treatment of many diseases. However, available measures suggest that pharmaceutical innovation has remained relatively flat, despite substantial growth in research and development spending. We review recent literature on pharmaceutical innovation to identify limitations in measuring and assessing innovation, and we describe the framework and collaborative approach we are using to develop more comprehensive, publicly available metrics for innovation. Our research teams at the Brookings Institution and Deerfield Institute are collaborating with experts from multiple areas of drug development and regulatory review to identify and collect comprehensive data elements related to key development and regulatory characteristics for each new molecular entity approved over the past several decades in the United States and the European Union. Subsequent phases of our effort will add data on downstream product use and patient outcomes and will also include drugs that have failed or been abandoned in development. Such a database will enable researchers to better analyze the drivers of drug innovation, trends in the output of new medicines, and the effect of policy efforts designed to improve innovation.

  4. Systemic inaccuracies in the National Surgical Quality Improvement Program database: Implications for accuracy and validity for neurosurgery outcomes research.

    PubMed

    Rolston, John D; Han, Seunggu J; Chang, Edward F

    2017-03-01

    The American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) provides a rich database of North American surgical procedures and their complications. Yet no external source has validated the accuracy of the information within this database. Using records from the 2006 to 2013 NSQIP database, we used two methods to identify errors: (1) mismatches between the Current Procedural Terminology (CPT) code that was used to identify the surgical procedure, and the International Classification of Diseases (ICD-9) post-operative diagnosis: i.e., a diagnosis that is incompatible with a certain procedure. (2) Primary anesthetic and CPT code mismatching: i.e., anesthesia not indicated for a particular procedure. Analyzing data for movement disorders, epilepsy, and tumor resection, we found evidence of CPT code and postoperative diagnosis mismatches in 0.4-100% of cases, depending on the CPT code examined. When analyzing anesthetic data from brain tumor, epilepsy, trauma, and spine surgery, we found evidence of miscoded anesthesia in 0.1-0.8% of cases. National databases like NSQIP are an important tool for quality improvement. Yet all databases are subject to errors, and measures of internal consistency show that errors affect up to 100% of case records for certain procedures in NSQIP. Steps should be taken to improve data collection on the frontend of NSQIP, and also to ensure that future studies with NSQIP take steps to exclude erroneous cases from analysis.

  5. Relational-database model for improving quality assurance and process control in a composite manufacturing environment

    NASA Astrophysics Data System (ADS)

    Gentry, Jeffery D.

    2000-05-01

    A relational database is a powerful tool for collecting and analyzing the vast amounts of inner-related data associated with the manufacture of composite materials. A relational database contains many individual database tables that store data that are related in some fashion. Manufacturing process variables as well as quality assurance measurements can be collected and stored in database tables indexed according to lot numbers, part type or individual serial numbers. Relationships between manufacturing process and product quality can then be correlated over a wide range of product types and process variations. This paper presents details on how relational databases are used to collect, store, and analyze process variables and quality assurance data associated with the manufacture of advanced composite materials. Important considerations are covered including how the various types of data are organized and how relationships between the data are defined. Employing relational database techniques to establish correlative relationships between process variables and quality assurance measurements is then explored. Finally, the benefits of database techniques such as data warehousing, data mining and web based client/server architectures are discussed in the context of composite material manufacturing.

  6. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex.

    PubMed

    Maccari, Giuseppe; Robinson, James; Ballingall, Keith; Guethlein, Lisbeth A; Grimholt, Unni; Kaufman, Jim; Ho, Chak-Sum; de Groot, Natasja G; Flicek, Paul; Bontrop, Ronald E; Hammond, John A; Marsh, Steven G E

    2017-01-04

    The IPD-MHC Database project (http://www.ebi.ac.uk/ipd/mhc/) collects and expertly curates sequences of the major histocompatibility complex from non-human species and provides the infrastructure and tools to enable accurate analysis. Since the first release of the database in 2003, IPD-MHC has grown and currently hosts a number of specific sections, with more than 7000 alleles from 70 species, including non-human primates, canines, felines, equids, ovids, suids, bovins, salmonids and murids. These sequences are expertly curated and made publicly available through an open access website. The IPD-MHC Database is a key resource in its field, and this has led to an average of 1500 unique visitors and more than 5000 viewed pages per month. As the database has grown in size and complexity, it has created a number of challenges in maintaining and organizing information, particularly the need to standardize nomenclature and taxonomic classification, while incorporating new allele submissions. Here, we describe the latest database release, the IPD-MHC 2.0 and discuss planned developments. This release incorporates sequence updates and new tools that enhance database queries and improve the submission procedure by utilizing common tools that are able to handle the varied requirements of each MHC-group.

  7. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex

    PubMed Central

    Maccari, Giuseppe; Robinson, James; Ballingall, Keith; Guethlein, Lisbeth A.; Grimholt, Unni; Kaufman, Jim; Ho, Chak-Sum; de Groot, Natasja G.; Flicek, Paul; Bontrop, Ronald E.; Hammond, John A.; Marsh, Steven G. E.

    2017-01-01

    The IPD-MHC Database project (http://www.ebi.ac.uk/ipd/mhc/) collects and expertly curates sequences of the major histocompatibility complex from non-human species and provides the infrastructure and tools to enable accurate analysis. Since the first release of the database in 2003, IPD-MHC has grown and currently hosts a number of specific sections, with more than 7000 alleles from 70 species, including non-human primates, canines, felines, equids, ovids, suids, bovins, salmonids and murids. These sequences are expertly curated and made publicly available through an open access website. The IPD-MHC Database is a key resource in its field, and this has led to an average of 1500 unique visitors and more than 5000 viewed pages per month. As the database has grown in size and complexity, it has created a number of challenges in maintaining and organizing information, particularly the need to standardize nomenclature and taxonomic classification, while incorporating new allele submissions. Here, we describe the latest database release, the IPD-MHC 2.0 and discuss planned developments. This release incorporates sequence updates and new tools that enhance database queries and improve the submission procedure by utilizing common tools that are able to handle the varied requirements of each MHC-group. PMID:27899604

  8. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

    PubMed Central

    Andorf, Carson; Dobbs, Drena; Honavar, Vasant

    2007-01-01

    Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence. PMID:17683567

  9. An annotated energy bibliography

    NASA Technical Reports Server (NTRS)

    Blow, S. J.

    1979-01-01

    Comprehensive annotated compilation of books, journals, periodicals, and reports on energy and energy related topics, contains approximately 10,0000 tehcnical and nontechnical references from bibliographic and other sources dated January 1975 through May 1977.

  10. An annotated energy bibliography

    NASA Technical Reports Server (NTRS)

    Blow, S. J.

    1979-01-01

    Comprehensive annotated compilation of books, journals, periodicals, and reports on energy and energy related topics, contains approximately 10,0000 tehcnical and nontechnical references from bibliographic and other sources dated January 1975 through May 1977.

  11. ERAIZDA: a model for holistic annotation of animal infectious and zoonotic diseases

    PubMed Central

    Buza, Teresia M.; Jack, Sherman W.; Kirunda, Halid; Khaitsa, Margaret L.; Lawrence, Mark L.; Pruett, Stephen; Peterson, Daniel G.

    2015-01-01

    There is an urgent need for a unified resource that integrates trans-disciplinary annotations of emerging and reemerging animal infectious and zoonotic diseases. Such data integration will provide wonderful opportunity for epidemiologists, researchers and health policy makers to make data-driven decisions designed to improve animal health. Integrating emerging and reemerging animal infectious and zoonotic disease data from a large variety of sources into a unified open-access resource provides more plausible arguments to achieve better understanding of infectious and zoonotic diseases. We have developed a model for interlinking annotations of these diseases. These diseases are of particular interest because of the threats they pose to animal health, human health and global health security. We demonstrated the application of this model using brucellosis, an infectious and zoonotic disease. Preliminary annotations were deposited into VetBioBase database (http://vetbiobase.igbb.msstate.edu). This database is associated with user-friendly tools to facilitate searching, retrieving and downloading of disease-related information. Database URL: http://vetbiobase.igbb.msstate.edu PMID:26581408

  12. The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes

    PubMed Central

    Overbeek, Ross; Begley, Tadhg; Butler, Ralph M.; Choudhuri, Jomuna V.; Chuang, Han-Yu; Cohoon, Matthew; de Crécy-Lagard, Valérie; Diaz, Naryttza; Disz, Terry; Edwards, Robert; Fonstein, Michael; Frank, Ed D.; Gerdes, Svetlana; Glass, Elizabeth M.; Goesmann, Alexander; Hanson, Andrew; Iwata-Reuyl, Dirk; Jensen, Roy; Jamshidi, Neema; Krause, Lutz; Kubal, Michael; Larsen, Niels; Linke, Burkhard; McHardy, Alice C.; Meyer, Folker; Neuweger, Heiko; Olsen, Gary; Olson, Robert; Osterman, Andrei; Portnoy, Vasiliy; Pusch, Gordon D.; Rodionov, Dmitry A.; Rückert, Christian; Steiner, Jason; Stevens, Rick; Thiele, Ines; Vassieva, Olga; Ye, Yuzhen; Zagnitko, Olga; Vonstein, Veronika

    2005-01-01

    The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms. PMID:16214803

  13. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.

    PubMed

    Overbeek, Ross; Begley, Tadhg; Butler, Ralph M; Choudhuri, Jomuna V; Chuang, Han-Yu; Cohoon, Matthew; de Crécy-Lagard, Valérie; Diaz, Naryttza; Disz, Terry; Edwards, Robert; Fonstein, Michael; Frank, Ed D; Gerdes, Svetlana; Glass, Elizabeth M; Goesmann, Alexander; Hanson, Andrew; Iwata-Reuyl, Dirk; Jensen, Roy; Jamshidi, Neema; Krause, Lutz; Kubal, Michael; Larsen, Niels; Linke, Burkhard; McHardy, Alice C; Meyer, Folker; Neuweger, Heiko; Olsen, Gary; Olson, Robert; Osterman, Andrei; Portnoy, Vasiliy; Pusch, Gordon D; Rodionov, Dmitry A; Rückert, Christian; Steiner, Jason; Stevens, Rick; Thiele, Ines; Vassieva, Olga; Ye, Yuzhen; Zagnitko, Olga; Vonstein, Veronika

    2005-01-01

    The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.

  14. Hip Fractures in Centenarians: Has Care Improved in the National Hip Fracture Database Era?

    PubMed

    Dick, Alastair G; Davenport, Dominic; Bansal, Mohit; Burch, Therese S; Edwards, Max R

    2017-09-01

    The number of centenarians in the United Kingdom is increasing. An associated increase in the incidence of hip fractures in the extreme elderly population is expected. The National Hip Fracture Database (NHFD) initiative was introduced in 2007 aiming to improve hip fracture care. There is a paucity of literature on the outcomes of centenarians with hip fractures since its introduction. The aim of this study is to report our experience of hip fractures in centenarians in the era since the introduction of the NHFD to assess outcomes in terms of mortality, time to surgery, length of stay, and complications. A retrospective case note study of all centenarians managed for a hip fracture over a 7-year period at a London district general hospital. We report on 22 centenarians sustaining 23 hip fractures between 2008 and 2015. Twenty-one fractures were managed operatively. For patients managed operatively, in-hospital, 30-day, 3-month, 6-month, 1-year, 2-year, 3-year, and 5-year cumulative mortalities were 30%, 30%, 39%, 50%, 77%, 86%, 95%, and 100%, respectively. In-hospital mortality was 100% for those managed nonoperatively. Mean time to surgery was 1.6 days (range: 0.7-6.3 days). Mean length of stay on the acute orthopedic ward was 23 days (range: 2-51 days). Seventy-one percent had a postoperative complication most commonly a hospital-acquired pneumonia or urinary tract infection. Compared to a series of centenarians with hip fractures prior to the introduction of the NHFD, we report a reduced time to surgery. Mortality and hospital length of stay were similar.

  15. Risk Factors Leading to Free Flap Failure: Analysis From the National Surgical Quality Improvement Program Database.

    PubMed

    Sanati-Mehrizy, Paymon; Massenburg, Benjamin B; Rozehnal, John M; Ingargiola, Michael J; Hernandez Rosa, Jonatan; Taub, Peter J

    2016-11-01

    The objective of this study was to identify risk factors for free flap failure among various anatomically based free flap subgroups. The 2005 to 2012 American College of Surgeons National Surgical Quality Improvement Program database was queried for patients undergoing microvascular free tissue transfer based on current procedural terminology codes. Univariate analysis was performed to identify any association between flap failure and the following factors: age, gender, race, body mass index (BMI), diabetes, smoking, alcohol use, hypertension, intraoperative transfusion, functional health status, American Society of Anesthesiologists class, operative time, and flap location. Factors yielding a significance of P < 0.20 were included in multivariate logistic regression models in order to identify independent risk factor significance for flap failure. Furthermore, patients were stratified based on recipient site (breast, head and neck, trunk, or extremity), and analysis was repeated in order to identify risk factors specific to each location. A total of 1921 of 2103 patients who underwent microvascular free flap reconstruction met inclusion criteria. Multivariate logistic regression identified BMI (adjusted odds ratio [AOR] = 1.07, P = 0.004) and male gender (AOR = 2.16, P = 0.033) as independent risk factors for flap failure. Among the "breast flaps" subgroup, BMI (AOR = 1.075, P = 0.012) and smoking (AOR = 3.35, P = 0.02) were independent variables associated with flap failure. In "head and neck flaps," operative time (AOR = 1.003, P = 0.018) was an independent risk factor for flap failure. No independent risk factors were identified for the "extremity flaps" or "trunk flaps" subtypes. BMI, smoking, and operative time were identified as independent risk factors for free flap failure among all flaps or within flap subsets.

  16. DDBJ progress report: a new submission system for leading to a correct annotation

    PubMed Central

    Kosuge, Takehide; Mashima, Jun; Kodama, Yuichi; Fujisawa, Takatomo; Kaminuma, Eli; Ogasawara, Osamu; Okubo, Kousaku; Takagi, Toshihisa; Nakamura, Yasukazu

    2014-01-01

    The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. This database content is shared with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). DDBJ launched a new nucleotide sequence submission system for receiving traditional nucleotide sequence. We expect that the new submission system will be useful for many submitters to input accurate annotation and reduce the time needed for data input. In addition, DDBJ has started a new service, the Japanese Genotype–phenotype Archive (JGA), with our partner institute, the National Bioscience Database Center (NBDC). JGA permanently archives and shares all types of individual human genetic and phenotypic data. We also introduce improvements in the DDBJ services and databases made during the past year. PMID:24194602

  17. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  18. A high-quality annotated transcriptome of swine peripheral blood.

    PubMed

    Liu, Haibo; Smith, Timothy P L; Nonneman, Dan J; Dekkers, Jack C M; Tuggle, Christopher K

    2017-06-24

    High throughput gene expression profiling assays of peripheral blood are widely used in biomedicine, as well as in animal genetics and physiology research. Accurate, comprehensive, and precise interpretation of such high throughput assays relies on well-characterized reference genomes and/or transcriptomes. However, neither the reference genome nor the peripheral blood transcriptome of the pig have been sufficiently assembled and annotated to support such profiling assays in this emerging biomedical model organism. We aimed to assemble published and novel RNA-seq data to provide a comprehensive, well-annotated blood transcriptome for pigs by integrating a de novo assembly with a genome-guided assembly. A de novo and a genome-guided transcriptome of porcine whole peripheral blood was assembled with ~162 million pairs of paired-end and ~183 million single-end, trimmed and normalized Illumina RNA-seq reads (~6 billion initial reads from 146 RNA-seq libraries) from five independent studies by using the Trinity and Cufflinks software, respectively. We then removed putative transcripts (PTs) of low confidence from both assemblies and merged the remaining PTs into an integrated transcriptome consisting of 132,928 PTs, with 126,225 (~95%) PTs from the de novo assembly and more than 91% of PTs spliced. In the integrated transcriptome, ~90% and 63% of PTs had significant sequence similarity to sequences in the NCBI NT and NR databases, respectively; 68,754 (~52%) PTs were annotated with 15,965 unique gene ontology (GO) terms; and 7618 PTs annotated with Enzyme Commission codes were assigned to 134 pathways curated by the Kyoto Encyclopedia of Genes and Genomes (KEGG). Full exon-intron junctions of 17,528 PTs were validated by PacBio IsoSeq full-length cDNA reads from 3 other porcine tissues, NCBI pig RefSeq mRNAs and transcripts from Ensembl Sscrofa10.2 annotation. Completeness of the 5' termini of 37,569 PTs was validated by public cap analysis of gene expression (CAGE

  19. The ability of older adults to use customized online medical databases to improve their health-related knowledge.

    PubMed

    Freund, Ophir; Reychav, Iris; McHaney, Roger; Goland, Ella; Azuri, Joseph

    2017-06-01

    Patient compliance with medical advice and recommended treatment depends on perception of health condition, medical knowledge, attitude, and self-efficacy. This study investigated how use of customized online medical databases, intended to improve knowledge in a variety of relevant medical topics, influenced senior adults' perceptions. Seventy-nine older adults in residence homes completed a computerized, tablet-based questionnaire, with medical scenarios and related questions. Following an intervention, control group participants answered questions without online help while an experimental group received internet links that directed them to customized, online medical databases. Medical knowledge and test scores among the experimental group significantly improved from pre- to post-intervention (p<0.0001) and was higher in comparison with the control group (p<0.0001). No significant change occurred in the control group. Older adults improved their knowledge in desired medical topic areas using customized online medical databases. The study demonstrated how such databases help solve health-related questions among older adult population members, and that older patients appear willing to consider technology usage in information acquisition. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. HIV Sequence Databases

    PubMed Central

    Kuiken, Carla; Korber, Bette; Shafer, Robert W.

    2008-01-01

    Two important databases are often used in HIV genetic research, the HIV Sequence Database in Los Alamos, which collects all sequences and focuses on annotation and data analysis, and the HIV RT/Protease Sequence Database in Stanford, which collects sequences associated with the development of viral resistance against anti-retroviral drugs and focuses on analysis of those sequences. The types of data and services these two databases offer, the tools they provide, and the way they are set up and operated are described in detail. PMID:12875108

  1. Deep Question Answering for protein annotation.

    PubMed

    Gobeill, Julien; Gaudinat, Arnaud; Pasche, Emilie; Vishnyakova, Dina; Gaudet, Pascale; Bairoch, Amos; Ruch, Patrick

    2015-01-01

    Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/. © The Author(s) 2015. Published by Oxford University Press.

  2. Deep Question Answering for protein annotation

    PubMed Central

    Gobeill, Julien; Gaudinat, Arnaud; Pasche, Emilie; Vishnyakova, Dina; Gaudet, Pascale; Bairoch, Amos; Ruch, Patrick

    2015-01-01

    Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/ PMID:26384372

  3. Improved Differentiation of Streptococcus pneumoniae and Other S. mitis Group Streptococci by MALDI Biotyper Using an Improved MALDI Biotyper Database Content and a Novel Result Interpretation Algorithm.

    PubMed

    Harju, Inka; Lange, Christoph; Kostrzewa, Markus; Maier, Thomas; Rantakokko-Jalava, Kaisu; Haanperä, Marjo

    2017-03-01

    Reliable distinction of Streptococcus pneumoniae and viridans group streptococci is important because of the different pathogenic properties of these organisms. Differentiation between S. pneumoniae and closely related Sreptococcusmitis species group streptococci has always been challenging, even when using such modern methods as 16S rRNA gene sequencing or matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry. In this study, a novel algorithm combined with an enhanced database was evaluated for differentiation between S. pneumoniae and S. mitis species group streptococci. One hundred one clinical S. mitis species group streptococcal strains and 188 clinical S. pneumoniae strains were identified by both the standard MALDI Biotyper database alone and that combined with a novel algorithm. The database update from 4,613 strains to 5,627 strains drastically improved the differentiation of S. pneumoniae and S. mitis species group streptococci: when the new database version containing 5,627 strains was used, only one of the 101 S. mitis species group isolates was misidentified as S. pneumoniae, whereas 66 of them were misidentified as S. pneumoniae when the earlier 4,613-strain MALDI Biotyper database version was used. The updated MALDI Biotyper database combined with the novel algorithm showed even better performance, producing no misidentifications of the S. mitis species group strains as S. pneumoniae All S. pneumoniae strains were correctly identified as S. pneumoniae with both the standard MALDI Biotyper database and the standard MALDI Biotyper database combined with the novel algorithm. This new algorithm thus enables reliable differentiation between pneumococci and other S. mitis species group streptococci with the MALDI Biotyper.

  4. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  5. Improved Search of Principal Component Analysis Databases for Spectro-polarimetric Inversion

    NASA Astrophysics Data System (ADS)

    Casini, R.; Asensio Ramos, A.; Lites, B. W.; López Ariste, A.

    2013-08-01

    We describe a simple technique for the acceleration of spectro-polarimetric inversions based on principal component analysis (PCA) of Stokes profiles. This technique involves the indexing of the database models based on the sign of the projections (PCA coefficients) of the first few relevant orders of principal components of the four Stokes parameters. In this way, each model in the database can be attributed a distinctive binary number of 24n bits, where n is the number of PCA orders used for the indexing. Each of these binary numbers (indices) identifies a group of "compatible" models for the inversion of a given set of observed Stokes profiles sharing the same index. The complete set of the binary numbers so constructed evidently determines a partition of the database. The search of the database for the PCA inversion of spectro-polarimetric data can profit greatly from this indexing. In practical cases it becomes possible to approach the ideal acceleration factor of 24n as compared to the systematic search of a non-indexed database for a traditional PCA inversion. This indexing method relies on the existence of a physical meaning in the sign of the PCA coefficients of a model. For this reason, the presence of model ambiguities and of spectro-polarimetric noise in the observations limits in practice the number n of relevant PCA orders that can be used for the indexing.

  6. IMPROVED SEARCH OF PRINCIPAL COMPONENT ANALYSIS DATABASES FOR SPECTRO-POLARIMETRIC INVERSION

    SciTech Connect

    Casini, R.; Lites, B. W.; Ramos, A. Asensio

    2013-08-20

    We describe a simple technique for the acceleration of spectro-polarimetric inversions based on principal component analysis (PCA) of Stokes profiles. This technique involves the indexing of the database models based on the sign of the projections (PCA coefficients) of the first few relevant orders of principal components of the four Stokes parameters. In this way, each model in the database can be attributed a distinctive binary number of 2{sup 4n} bits, where n is the number of PCA orders used for the indexing. Each of these binary numbers (indices) identifies a group of ''compatible'' models for the inversion of a given set of observed Stokes profiles sharing the same index. The complete set of the binary numbers so constructed evidently determines a partition of the database. The search of the database for the PCA inversion of spectro-polarimetric data can profit greatly from this indexing. In practical cases it becomes possible to approach the ideal acceleration factor of 2{sup 4n} as compared to the systematic search of a non-indexed database for a traditional PCA inversion. This indexing method relies on the existence of a physical meaning in the sign of the PCA coefficients of a model. For this reason, the presence of model ambiguities and of spectro-polarimetric noise in the observations limits in practice the number n of relevant PCA orders that can be used for the indexing.

  7. Semantic Annotation of Mutable Data

    PubMed Central

    Morris, Robert A.; Dou, Lei; Hanken, James; Kelly, Maureen; Lowery, David B.; Ludäscher, Bertram; Macklin, James A.; Morris, Paul J.

    2013-01-01

    Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema. PMID:24223697

  8. Animal QTLdb: an improved database tool for livestock animal QTL/association data dissemination in the post-genome era.

    PubMed

    Hu, Zhi-Liang; Park, Carissa A; Wu, Xiao-Lin; Reecy, James M

    2013-01-01

    The Animal QTL database (QTLdb; http://www.animalgenome.org/QTLdb) is designed to house all publicly available QTL and single-nucleotide polymorphism/gene association data on livestock animal species. An earlier version was published in the Nucleic Acids Research Database issue in 2007. Since then, we have continued our efforts to develop new and improved database tools to allow more data types, parameters and functions. Our efforts have transformed the Animal QTLdb into a tool that actively serves the research community as a quality data repository and more importantly, a provider of easily accessible tools and functions to disseminate QTL and gene association information. The QTLdb has been heavily used by the livestock genomics community since its first public release in 2004. To date, there are 5920 cattle, 3442 chicken, 7451 pigs, 753 sheep and 88 rainbow trout data points in the database, and at least 290 publications that cite use of the database. The rapid advancement in genomic studies of cattle, chicken, pigs, sheep and other livestock animals has presented us with challenges, as well as opportunities for the QTLdb to meet the evolving needs of the research community. Here, we report our progress over the recent years and highlight new functions and services available to the general public.

  9. Complementary use of the SciSearch database for improved biomedical information searching.

    PubMed Central

    Brown, C M

    1998-01-01

    The use of at least two complementary online biomedical databases is generally considered critical for biomedical scientists seeking to keep fully abreast of recent research developments as well as to retrieve the highest number of relevant citations possible. Although the National Library of Medicine's MEDLINE is usually the database of choice, this paper illustrates the benefits of using another database, the Institute for Scientific Information's SciSearch, when conducting a biomedical information search. When a simple query about red wine consumption and coronary artery disease was posed simultaneously in both MEDLINE and SciSearch, a greater number of relevant citations were retrieved through SciSearch. This paper also provides suggestions for carrying out a comprehensive biomedical literature search in a rapid and efficient manner by using SciSearch in conjunction with MEDLINE. PMID:9549014

  10. Improving HJ-1B IRS land surface temperature product using ASTER global emissivity database

    NASA Astrophysics Data System (ADS)

    Li, H.; Hu, T.; Meng, X.; Yongming, D.; Cao, B.; Liu, Q.

    2015-12-01

    Land surface temperature (LST) is a key parameter for hydrological, meteorological, climatological and environmental studies. Currently many operational LST products have been generated using European and American satellite data, i.e., the Advanced Very High Resolution Radiometer (AVHRR), Advanced Along-Track Scanning Radiometer (AATSR) and Moderate Resolution Imaging Spectroradiometer (MODIS). However, few LST product has been produced using Chinese satellite data. Thus, the objective of this study is to generate reliable LST product using Chinese HJ-1B satellite data. The HJ-1B satellite of China, were launched on September 6, 2008, which are used for disaster and environment monitoring. IRS (Infrared Scanner) is one of the key instruments onboard HJ-1B satellite, it can scan the earth every four days, has four spectral bands ranging from the near-infrared to thermal infrared bands (band 1 0.75 - 1.10μm, band 2 1.55-1.75μm, MIR band 3 3.50 - 3.90μm, band 4 10.5-12.5μm) with 720 km swath. It scans ±29° from nadir and the spatial resolution for band1-3 is 150m and 300m for band4. In this study, a single-channel parametric model (SC-PM) algorithm were used to produce 300m LST product from HJ-1B IRS data. The NCEP atmospheric profiles and a parametric model were used for atmospheric correction. In order to improve the accuracy of the land surface emissivity (LSE), the 1km ASTER Global Emissivity Database (GED) and self-developed 5-day 1km vegetation cover product were used for estimating the LSE based on the Vegetation Cover Method. Two years of HJ-1B IRS LST product in Heihe River basin (Gansu province, China) from June 2012 to June 2014 were generated. The LST products were evaluated against ground observations in an arid area of northwest China during the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) experiment. Four barren surface sites and ten vegetated sites were chosen for the evaluation. The results show that the developed HJ-1B IRS

  11. The Vertebrate Genome Annotation browser 10 years on.

    PubMed

    Harrow, Jennifer L; Steward, Charles A; Frankish, Adam; Gilbert, James G; Gonzalez, Jose M; Loveland, Jane E; Mudge, Jonathan; Sheppard, Dan; Thomas, Mark; Trevanion, Stephen; Wilming, Laurens G

    2014-01-01

    The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).

  12. The Coral Triangle Atlas: An Integrated Online Spatial Database System for Improving Coral Reef Management

    PubMed Central

    Cros, Annick; Ahamad Fatan, Nurulhuda; White, Alan; Teoh, Shwu Jiau; Tan, Stanley; Handayani, Christian; Huang, Charles; Peterson, Nate; Venegas Li, Ruben; Siry, Hendra Yusran; Fitriana, Ria; Gove, Jamison; Acoba, Tomoko; Knight, Maurice; Acosta, Renerio; Andrew, Neil; Beare, Doug

    2014-01-01

    In this paper we describe the construction of an online GIS database system, hosted by WorldFish, which stores bio-physical, ecological and socio-economic data for the ‘Coral Triangle Area’ in South-east Asia and the Pacific. The database has been built in partnership with all six (Timor-Leste, Malaysia, Indonesia, The Philippines, Solomon Islands and Papua New Guinea) of the Coral Triangle countries, and represents a valuable source of information for natural resource managers at the regional scale. Its utility is demonstrated using biophysical data, data summarising marine habitats, and data describing the extent of marine protected areas in the region. PMID:24941442

  13. The Coral Triangle Atlas: an integrated online spatial database system for improving coral reef management.

    PubMed

    Cros, Annick; Ahamad Fatan, Nurulhuda; White, Alan; Teoh, Shwu Jiau; Tan, Stanley; Handayani, Christian; Huang, Charles; Peterson, Nate; Venegas Li, Ruben; Siry, Hendra Yusran; Fitriana, Ria; Gove, Jamison; Acoba, Tomoko; Knight, Maurice; Acosta, Renerio; Andrew, Neil; Beare, Doug

    2014-01-01

    In this paper we describe the construction of an online GIS database system, hosted by WorldFish, which stores bio-physical, ecological and socio-economic data for the 'Coral Triangle Area' in South-east Asia and the Pacific. The database has been built in partnership with all six (Timor-Leste, Malaysia, Indonesia, The Philippines, Solomon Islands and Papua New Guinea) of the Coral Triangle countries, and represents a valuable source of information for natural resource managers at the regional scale. Its utility is demonstrated using biophysical data, data summarising marine habitats, and data describing the extent of marine protected areas in the region.

  14. Biomedical article retrieval using multimodal features and image annotations in region-based CBIR

    NASA Astrophysics Data System (ADS)

    You, Daekeun; Antani, Sameer; Demner-Fushman, Dina; Rahman, Md Mahmudur; Govindaraju, Venu; Thoma, George R.

    2010-01-01

    Biomedical images are invaluable in establishing diagnosis, acquiring technical skills, and implementing best practices in many areas of medicine. At present, images needed for instructional purposes or in support of clinical decisions appear in specialized databases and in biomedical articles, and are often not easily accessible to retrieval tools. Our goal is to automatically annotate images extracted from scientific publications with respect to their usefulness for clinical decision support and instructional purposes, and project the annotations onto images stored in databases by linking images through content-based image similarity. Authors often use text labels and pointers overlaid on figures and illustrations in the articles to highlight regions of interest (ROI). These annotations are then referenced in the caption text or figure citations in the article text. In previous research we have developed two methods (a heuristic and dynamic time warping-based methods) for localizing and recognizing such pointers on biomedical images. In this work, we add robustness to our previous efforts by using a machine learning based approach to localizing and recognizing the pointers. Identifying these can assist in extracting relevant image content at regions within the image that are likely to be highly relevant to the discussion in the article text. Image regions can then be annotated using biomedical concepts from extracted snippets of text pertaining to images in scientific biomedical articles that are identified using National Library of Medicine's Unified Medical Language System® (UMLS) Metathesaurus. The resulting regional annotation and extracted image content are then used as indices for biomedical article retrieval using the multimodal features and region-based content-based image retrieval (CBIR) techniques. The hypothesis that such an approach would improve biomedical document retrieval is validated through experiments on an expert-marked biomedical article

  15. Visualizing Genomic Annotations with the UCSC Genome Browser.

    PubMed

    Hung, Jui-Hung; Weng, Zhiping

    2016-11-01

    Genomic data and annotations are rapidly accumulating in databases such as the UCSC Genome Browser, NCBI, and Ensembl. Given the massive scale of these genomic databases, it is important to be able to easily retrieve known data and annotations of a specified genomic locus. For example, for a newly identified cis-regulatory element bound by a transcription factor, questions that immediately come to mind include whether the element is near a transcriptional start site and, if so, the name of the corresponding gene, and whether the histones or DNA at the locus are modified. The UCSC Genome Browser organizes data and annotations (called tracks) around the reference sequences or draft assemblies of many eukaryotic genomes and presents them using a powerful web-based graphical interface. This protocol describes how to use the UCSC Genome Browser to visualize selected tracks at specified genomic regions, download the data and annotations for further analysis, and retrieve multiple sequence alignments and their conservation scores.

  16. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  17. Improving wilderness stewardship through searchable databases of U.S. legislative history and legislated special provisions

    Treesearch

    David R. Craig; Peter Landres; Laurie Yung

    2010-01-01

    The online resource Wilderness.net currently provides quick access to the text of every public law designating wilderness in the U.S. National Wilderness Preservation System (NWPS). This article describes two new searchable databases recently completed and added to the information available on Wilderness.net to help wilderness managers and others understand and...

  18. Improving Database Vendors' Usage Statistics Reporting through Collaboration between Libraries and Vendors.

    ERIC Educational Resources Information Center

    Shim, Wonsik; McClure, Charles R.

    2002-01-01

    Reports results from the Association of Research Libraries (ARL) E-Metrics study to investigate issues associated with the usage statistics provided by database vendors. Considers standardization of key usage statistics and reporting format and validation of reported statistics; and offers recommendations for libraries and calls for continuous…

  19. Improving Child Outcomes with Data-Based Decision Making: Collecting Data

    ERIC Educational Resources Information Center

    Hojnoski, Robin L.; Gischlar, Karen L.; Missall, Kristen N.

    2009-01-01

    Collecting and graphing performance data are important parts of the educational process. Such procedures help educators, caregivers, and other important stakeholders make data-based decisions to accelerate child progress. With school-age children, collecting and graphing data have been associated with more frequent instructional changes to better…

  20. The UCSC Genome Browser database: 2016 update

    PubMed Central

    Speir, Matthew L.; Zweig, Ann S.; Rosenbloom, Kate R.; Raney, Brian J.; Paten, Benedict; Nejad, Parisa; Lee, Brian T.; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S.; Heitner, Steve; Harte, Rachel A.; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A.; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P.; Haussler, David; Kuhn, Robert M.; Kent, W. James

    2016-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the “Data Integrator”, for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment. PMID:26590259

  1. The UCSC Genome Browser database: 2016 update.

    PubMed

    Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James

    2016-01-04

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.

  2. IUPHAR-DB: updated database content and new features.

    PubMed

    Sharman, Joanna L; Benson, Helen E; Pawson, Adam J; Lukito, Veny; Mpamhanga, Chidochangu P; Bombail, Vincent; Davenport, Anthony P; Peters, John A; Spedding, Michael; Harmar, Anthony J

    2013-01-01

    The International Union of Basic and Clinical Pharmacology (IUPHAR) database, IUPHAR-DB (http://www.iuphar-db.org) is an open access, online database providing detailed, expert-driven annotation of the primary literature on human and rodent receptors and other drug targets, together with the substances that act on them. The present release includes information on the products of 646 genes from four major protein classes (G protein-coupled receptors, nuclear hormone receptors, voltage- and ligand-gated ion channels) and ∼3180 bioactive molecules (endogenous ligands, licensed drugs and key pharmacological tools) that interact with them. We have described previously the classification and curation of data for small molecule ligands in the database; in this update we have annotated 366 endogenous peptide ligands with their amino acid sequences, post-translational modifications, links to precursor genes, species differences and relationships with other molecules in the database (e.g. those derived from the same precursor). We have also matched targets with their endogenous ligands (peptides and small molecules), with particular attention paid to identifying bioactive peptide ligands generated by post-translational modification of precursor proteins. Other improvements to the database include enhanced information on the clinical relevance of targets and ligands in the database, more extensive links to other databases and a pilot project for the curation of enzymes as drug targets.

  3. An annotation system for 3D fluid flow visualization

    NASA Technical Reports Server (NTRS)

    Loughlin, Maria M.; Hughes, John F.

    1995-01-01

    Annotation is a key activity of data analysis. However, current systems for data analysis focus almost exclusively on visualization. We propose a system which integrates annotations into a visualization system. Annotations are embedded in 3D data space, using the Post-it metaphor. This embedding allows contextual-based information storage and retrieval, and facilitates information sharing in collaborative environments. We provide a traditional database filter and a Magic Lens filter to create specialized views of the data. The system has been customized for fluid flow applications, with features which allow users to store parameters of visualization tools and sketch 3D volumes.

  4. Gene ontology annotation by density and gravitation models.

    PubMed

    Hou, Wen-Juan; Lin, Kevin Hsin-Yih; Chen, Hsin-Hsi

    2006-01-01

    Gene Ontology (GO) is developed to provide standard vocabularies of gene products in different databases. The process of annotating GO terms to genes requires curators to read through lengthy articles. Methods for speeding up or automating the annotation process are thus of great importance. We propose a GO annotation approach using full-text biomedical documents for directing more relevant papers to curators. This system explores word density and gravitation relationships between genes and GO terms. Different density and gravitation models are built and several evaluation criteria are employed to assess the effects of the proposed methods.

  5. The ChEMBL database in 2017.

    PubMed

    Gaulton, Anna; Hersey, Anne; Nowotka, Michał; Bento, A Patrícia; Chambers, Jon; Mendez, David; Mutowo, Prudence; Atkinson, Francis; Bellis, Louisa J; Cibrián-Uhalte, Elena; Davies, Mark; Dedman, Nathan; Karlsson, Anneli; Magariños, María Paula; Overington, John P; Papadatos, George; Smit, Ines; Leach, Andrew R

    2017-01-04

    ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services.

  6. The ChEMBL database in 2017

    PubMed Central

    Gaulton, Anna; Hersey, Anne; Nowotka, Michał; Bento, A. Patrícia; Chambers, Jon; Mendez, David; Mutowo, Prudence; Atkinson, Francis; Bellis, Louisa J.; Cibrián-Uhalte, Elena; Davies, Mark; Dedman, Nathan; Karlsson, Anneli; Magariños, María Paula; Overington, John P.; Papadatos, George; Smit, Ines; Leach, Andrew R.

    2017-01-01

    ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services. PMID:27899562

  7. Injectors and Annotations

    NASA Technical Reports Server (NTRS)

    Filman, Robert E.

    2004-01-01

    In a previous paper, we presented the Object Infrastructure Framework. The goal of that system is to simplify the creation of distributed applications. The primary claim of that work is that non-functional 'ilities' could be achieved by controlling and manipulating the communications between components, thereby simplifying the development of distributed systems. A secondary element of that paper is to argue for extending the conventional distributed objects model in two important ways: 1) The ability to insert injectors (filters, wrappers) into the communication path between components; 2) The ability to annotate communications with additional information, and to propagate these annotations through an application. Here we express the descriptions of that paper.

  8. The Genopolis Microarray Database

    PubMed Central

    Splendiani, Andrea; Brandizi, Marco; Even, Gael; Beretta, Ottavio; Pavelka, Norman; Pelizzola, Mattia; Mayhaus, Manuel; Foti, Maria; Mauri, Giancarlo; Ricciardi-Castagnoli, Paola

    2007-01-01

    Background Gene expression databases are key resources for microarray data management and analysis and the importance of a proper annotation of their content is well understood. Public repositories as well as microarray database systems that can be implemented by single laboratories exist. However, there is not yet a tool that can easily support a collaborative environment where different users with different rights of access to data can interact to define a common highly coherent content. The scope of the Genopolis database is to provide a resource that allows different groups performing microarray experiments related to a common subject to create a common coherent knowledge base and to analyse it. The Genopolis database has been implemented as a dedicated system for the scientific community studying dendritic and macrophage cells functions and host-parasite interactions. Results The Genopolis Database system allows the community to build an object based MIAME compliant annotation of their experiments and to store images, raw and processed data from the Affymetrix GeneChip® platform. It supports dynamical definition of controlled vocabularies and provides automated and supervised steps to control the coherence of data and annotations. It allows a precise control of the visibility of the database content to different sub groups in the community and facilitates exports of its content to public repositories. It provides an interactive users interface for data analysis: this allows users to visualize data matrices based on functional lists and sample characterization, and to navigate to other data matrices defined by similarity of expression values as well as functional characterizations of genes involved. A collaborative environment is also provided for the definition and sharing of functional annotation by users. Conclusion The Genopolis Database supports a community in building a common coherent knowledge base and analyse it. This fills a gap between a local

  9. MIPS: analysis and annotation of genome information in 2007.

    PubMed

    Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  10. SweetSEQer, Simple de Novo Filtering and Annotation of Glycoconjugate Mass Spectra

    PubMed Central

    Serang, Oliver; Froehlich, John W.; Muntel, Jan; McDowell, Gary; Steen, Hanno; Lee, Richard S.; Steen, Judith A.

    2013-01-01

    The past 15 years have seen significant progress in LC-MS/MS peptide sequencing, including the advent of successful de novo and database search methods; however, analysis of glycopeptide and, more generally, glycoconjugate spectra remains a much more open problem, and much annotation is still performed manually. This is partly because glycans, unlike peptides, need not be linear chains and are instead described by trees. In this study, we introduce SweetSEQer, an extremely simple open source tool for identifying potential glycopeptide MS/MS spectra. We evaluate SweetSEQer on manually curated glycoconjugate spectra and on negative controls, and we demonstrate high quality filtering that can be easily improved for specific applications. We also demonstrate a high overlap between peaks annotated by experts and peaks annotated by SweetSEQer, as well as demonstrate inferred glycan graphs consistent with canonical glycan tree motifs. This study presents a novel tool for annotating spectra and producing glycan graphs from LC-MS/MS spectra. The tool is evaluated and shown to perform similarly to an expert on manually curated data. PMID:23443135

  11. Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.

    PubMed

    Lees, Jonathan G; Lee, David; Studer, Romain A; Dawson, Natalie L; Sillitoe, Ian; Das, Sayoni; Yeats, Corin; Dessailly, Benoit H; Rentzsch, Robert; Orengo, Christine A

    2014-01-01

    Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.

  12. The GATO gene annotation tool for research laboratories.

    PubMed

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.

  13. Handling of missing data to improve the mining of large feed databases.

    PubMed

    Maroto-Molina, F; Gómez-Cabrera, A; Guerrero-Ginel, J E; Garrido-Varo, A; Sauvant, D; Tran, G; Heuzé, V; Pérez-Marín, D C

    2013-01-01

    Feed databases often have missing data. Despite their potentially major effect on data analysis (e.g., as a source of biased results and loss of statistical power), database managers and nutrition researchers have paid little attention to missing data. This study evaluated various methods of handling missing data using mining outputs from a database containing data on chemical composition and nutritive value for 18,864 alfalfa samples. A complete reference dataset was obtained comprising the 2,303 cases with no missing data for the attributes CP, crude fiber (CF), NDF, ADF and ADL. This dataset was used to simulate 2 types of missing data (at random and not at random), each with 2 loss intensities (33 and 66%), thus yielding a total of 4 incomplete datasets. Missing data from these datasets were handled using 2 deletion methods and 4 imputation methods, and outputs in terms of the identification and typing of alfalfa (using ANOVA and descriptive statistics) and of correlations between attributes (using regressions) were compared with outputs from the complete dataset. Imputation methods, particularly model-based versions, were found to perform better than deletion methods in terms of maximizing information use and minimizing bias although the extent of differences between methods depended on the type of missing data. The best approximation to the uncertainty value was provided by multiple imputation methods. It was concluded that the choice of the most suitable method for handling missing data depended both on the type of missing data and on the purpose of data analysis.

  14. Global Mapping of Traditional Chinese Medicine into Bioactivity Space and Pathways Annotation Improves Mechanistic Understanding and Discovers Relationships between Therapeutic Action (Sub)classes

    PubMed Central

    Mohamad Zobir, Siti Zuraidah; Mohd Fauzi, Fazlin; Liggi, Sonia; Drakakis, Georgios; Fu, Xianjun; Fan, Tai-Ping; Bender, Andreas

    2016-01-01

    Traditional Chinese medicine (TCM) still needs more scientific rationale to be proven for it to be accepted further in the West. We are now in the position to propose computational hypotheses for the mode-of-actions (MOAs) of 45 TCM therapeutic action (sub)classes from in silico target prediction algorithms, whose target was later annotated with Kyoto Encyclopedia of Genes and Genomes pathway, and to discover the relationship between them by generating a hierarchical clustering. The results of 10,749 TCM compounds showed 183 enriched targets and 99 enriched pathways from Estimation Score ≤ 0 and ≥ 5% of compounds/targets in a (sub)class. The MOA of a (sub)class was established from supporting literature. Overall, the most frequent top three enriched targets/pathways were immune-related targets such as tyrosine-protein phosphatase nonreceptor type 2 (PTPN2) and digestive system such as mineral absorption. We found two major protein families, G-protein coupled receptor (GPCR), and protein kinase family contributed to the diversity of the bioactivity space, while digestive system was consistently annotated pathway motif, which agreed with the important treatment principle of TCM, “the foundation of acquired constitution” that includes spleen and stomach. In short, the TCM (sub)classes, in many cases share similar targets/pathways despite having different indications. PMID:26989424

  15. Global Mapping of Traditional Chinese Medicine into Bioactivity Space and Pathways Annotation Improves Mechanistic Understanding and Discovers Relationships between Therapeutic Action (Sub)classes.

    PubMed

    Mohamad Zobir, Siti Zuraidah; Mohd Fauzi, Fazlin; Liggi, Sonia; Drakakis, Georgios; Fu, Xianjun; Fan, Tai-Ping; Bender, Andreas

    2016-01-01

    Traditional Chinese medicine (TCM) still needs more scientific rationale to be proven for it to be accepted further in the West. We are now in the position to propose computational hypotheses for the mode-of-actions (MOAs) of 45 TCM therapeutic action (sub)classes from in silico target prediction algorithms, whose target was later annotated with Kyoto Encyclopedia of Genes and Genomes pathway, and to discover the relationship between them by generating a hierarchical clustering. The results of 10,749 TCM compounds showed 183 enriched targets and 99 enriched pathways from Estimation Score ≤ 0 and ≥ 5% of compounds/targets in a (sub)class. The MOA of a (sub)class was established from supporting literature. Overall, the most frequent top three enriched targets/pathways were immune-related targets such as tyrosine-protein phosphatase nonreceptor type 2 (PTPN2) and digestive system such as mineral absorption. We found two major protein families, G-protein coupled receptor (GPCR), and protein kinase family contributed to the diversity of the bioactivity space, while digestive system was consistently annotated pathway motif, which agreed with the important treatment principle of TCM, "the foundation of acquired constitution" that includes spleen and stomach. In short, the TCM (sub)classes, in many cases share similar targets/pathways despite having different indications.

  16. Assembly and annotation of a non-model gastropod (Nerita melanotragus) transcriptome: a comparison of de novo assemblers.

    PubMed

    Amin, Shorash; Prentis, Peter J; Gilding, Edward K; Pavasovic, Ana

    2014-08-01

    The sequencing, de novo assembly and annotation of transcriptome datasets generated with next generation sequencing (NGS) has enabled biologists to answer genomic questions in non-model species with unprecedented ease. Reliable and accurate de novo assembly and annotation of transcriptomes, however, is a critically important step for transcriptome assemblies generated from short read sequences. Typical benchmarks for assembly and annotation reliability have been performed with model species. To address the reliability and accuracy of de novo transcriptome assembly in non-model species, we generated an RNAseq dataset for an intertidal gastropod mollusc species, Nerita melanotragus, and compared the assembly produced by four different de novo transcriptome assemblers; Velvet, Oases, Geneious and Trinity, for a number of quality metrics and redundancy. Transcriptome sequencing on the Ion Torrent PGM™ produced 1,883,624 raw reads with a mean length of 133 base pairs (bp). Both the Trinity and Oases de novo assemblers produced the best assemblies based on all quality metrics including fewer contigs, increased N50 and average contig length and contigs of greater length. Overall the BLAST and annotation success of our assemblies was not high with only 15-19% of contigs assigned a putative function. We believe that any improvement in annotation success of gastropod species will require more gastropod genome sequences, but in particular an increase in mollusc protein sequences in public databases. Overall, this paper demonstrates that reliable and accurate de novo transcriptome assemblies can be generated from short read sequencers with the right assembly algorithms.

  17. Mouse genome database 2016

    PubMed Central

    Bult, Carol J.; Eppig, Janan T.; Blake, Judith A.; Kadin, James A.; Richardson, Joel E.

    2016-01-01

    The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the primary community model organism database for the laboratory mouse and serves as the source for key biological reference data related to mouse genes, gene functions, phenotypes and disease models with a strong emphasis on the relationship of these data to human biology and disease. As the cost of genome-scale sequencing continues to decrease and new technologies for genome editing become widely adopted, the laboratory mouse is more important than ever as a model system for understanding the biological significance of human genetic variation and for advancing the basic research needed to support the emergence of genome-guided precision medicine. Recent enhancements to MGD include new graphical summaries of biological annotations for mouse genes, support for mobile access to the database, tools to support the annotation and analysis of sets of genes, and expanded support for comparative biology through the expansion of homology data. PMID:26578600

  18. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.

    PubMed

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org.

  19. Cheating. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Wildemuth, Barbara M., Comp.

    This 89-item, annotated bibliography was compiled to provide access to research and discussions of cheating and, specifically, cheating on tests. It is not limited to any educational level, nor is it confined to any specific curriculum area. Two data bases were searched by computer, and a library search was conducted. A computer search of the…

  20. Automated Microbial Genome Annotation

    SciTech Connect

    Land, Miriam

    2009-05-29

    Miriam Land of the DOE Joint Genome Institute at Oak Ridge National Laboratory gives a talk on the current state and future challenges of moving toward automated microbial genome annotation at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM