Science.gov

Sample records for annotation databases improves

  1. Considerations to improve functional annotations in biological databases.

    PubMed

    Benítez-Páez, Alfonso

    2009-12-01

    Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community.

  2. Incorporating the human gene annotations in different databases significantly improved transcriptomic and genetic analyses.

    PubMed

    Chen, Geng; Wang, Charles; Shi, Leming; Qu, Xiongfei; Chen, Jiwei; Yang, Jianmin; Shi, Caiping; Chen, Long; Zhou, Peiying; Ning, Baitang; Tong, Weida; Shi, Tieliu

    2013-04-01

    Human gene annotation is crucial for conducting transcriptomic and genetic studies; however, the impacts of human gene annotations in diverse databases on related studies have been less evaluated. To enable full use of various human annotation resources and better understand the human transcriptome, here we systematically compare the human annotations present in RefSeq, Ensembl (GENCODE), and AceView on diverse transcriptomic and genetic analyses. We found that the human gene annotations in the three databases are far from complete. Although Ensembl and AceView annotated more genes than RefSeq, more than 15,800 genes from Ensembl (or AceView) are within the intergenic and intronic regions of AceView (or Ensembl) annotation. The human transcriptome annotations in RefSeq, Ensembl, and AceView had distinct effects on short-read mapping, gene and isoform expression profiling, and differential expression calling. Furthermore, our findings indicate that the integrated annotation of these databases can obtain a more complete gene set and significantly enhance those transcriptomic analyses. We also observed that many more known SNPs were located within genes annotated in Ensembl and AceView than in RefSeq. In particular, 1033 of 3041 trait/disease-associated SNPs involved in about 200 human traits/diseases that were previously reported to be in RefSeq intergenic regions could be relocated within Ensembl and AceView genes. Our findings illustrate that a more complete transcriptome generated by incorporating human gene annotations in diverse databases can strikingly improve the overall results of transcriptomic and genetic studies.

  3. STANDARDIZATION AND STRUCTURAL ANNOTATION OF PUBLIC TOXICITY DATABASES: IMPROVING SAR CAPABILITIES AND LINKAGE TO 'OMICS DATA

    EPA Science Inventory

    Standardization and structural annotation of public toxicity databases: Improving SAR capabilities and linkage to 'omics data
    Ann M. Richard', ClarLynda Williams', Jamie Burch2
    'Nat Health & Environ Res Lab, US EPA, RTP, NC 27711; 2EPA/NC Central Univ Student COOP Trainee<...

  4. The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations

    PubMed Central

    Cerqueira, Gustavo C.; Arnaud, Martha B.; Inglis, Diane O.; Skrzypek, Marek S.; Binkley, Gail; Simison, Matt; Miyasato, Stuart R.; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R.

    2014-01-01

    The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome. PMID:24194595

  5. Mining GO annotations for improving annotation consistency.

    PubMed

    Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, António E N; Albrecht, Mario; Falcão, André O

    2012-01-01

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

  6. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with

  7. Bovine Genome Database: integrated tools for genome annotation and discovery.

    PubMed

    Childers, Christopher P; Reese, Justin T; Sundaram, Jaideep P; Vile, Donald C; Dickens, C Michael; Childs, Kevin L; Salih, Hanni; Bennett, Anna K; Hagen, Darren E; Adelson, David L; Elsik, Christine G

    2011-01-01

    The Bovine Genome Database (BGD; http://BovineGenome.org) strives to improve annotation of the bovine genome and to integrate the genome sequence with other genomics data. BGD includes GBrowse genome browsers, the Apollo Annotation Editor, a quantitative trait loci (QTL) viewer, BLAST databases and gene pages. Genome browsers, available for both scaffold and chromosome coordinate systems, display the bovine Official Gene Set (OGS), RefSeq and Ensembl gene models, non-coding RNA, repeats, pseudogenes, single-nucleotide polymorphism, markers, QTL and alignments to complementary DNAs, ESTs and protein homologs. The Bovine QTL viewer is connected to the BGD Chromosome GBrowse, allowing for the identification of candidate genes underlying QTL. The Apollo Annotation Editor connects directly to the BGD Chado database to provide researchers with remote access to gene evidence in a graphical interface that allows editing and creating new gene models. Researchers may upload their annotations to the BGD server for review and integration into the subsequent release of the OGS. Gene pages display information for individual OGS gene models, including gene structure, transcript variants, functional descriptions, gene symbols, Gene Ontology terms, annotator comments and links to National Center for Biotechnology Information and Ensembl. Each gene page is linked to a wiki page to allow input from the research community.

  8. Estimating the annotation error rate of curated GO database sequence annotations

    PubMed Central

    Jones, Craig E; Brown, Alfred L; Baumann, Ute

    2007-01-01

    Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information. PMID:17519041

  9. MODBASE, a database of annotated comparative protein structure models.

    PubMed

    Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C; Ilyin, Valentin A; Sali, Andrej

    2002-01-01

    MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10(-4)) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server.

  10. The GOA database: Gene Ontology annotation updates for 2015

    PubMed Central

    Huntley, Rachael P.; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J.; O'Donovan, Claire

    2015-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats. PMID:25378336

  11. Improving gene annotation using peptide mass spectrometry

    PubMed Central

    Tanner, Stephen; Shen, Zhouxin; Ng, Julio; Florea, Liliana; Guigó, Roderic; Briggs, Steven P.; Bafna, Vineet

    2007-01-01

    Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project. PMID:17189379

  12. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine

    PubMed Central

    Elsik, Christine G.; Tayal, Aditi; Diesh, Colin M.; Unni, Deepak R.; Emery, Marianne L.; Nguyen, Hung N.; Hagen, Darren E.

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  13. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    PubMed

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-04

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search.

  14. NECTAR: a database of codon-centric missense variant annotations.

    PubMed

    Gong, Sungsam; Ware, James S; Walsh, Roddy; Cook, Stuart A

    2014-01-01

    NECTAR (Non-synonymous Enriched Coding muTation ARchive; http://nectarmutation.org) is a database and web application to annotate disease-related and functionally important amino acids in human proteins. A number of tools are available to facilitate the interpretation of DNA variants identified in diagnostic or research sequencing. These typically identify previous reports of DNA variation at a given genomic location, predict its effects on transcript and protein sequence and may predict downstream functional consequences. Previous reports and functional annotations are typically linked by the genomic location of the variant observed. NECTAR collates disease-causing variants and functionally important amino acid residues from a number of sources. Importantly, rather than simply linking annotations by a shared genomic location, NECTAR annotates variants of interest with details of previously reported variation affecting the same codon. This provides a much richer data set for the interpretation of a novel DNA variant. NECTAR also identifies functionally equivalent amino acid residues in evolutionarily related proteins (paralogues) and, where appropriate, transfers annotations between them. As well as accessing these data through a web interface, users can upload batches of variants in variant call format (VCF) for annotation on-the-fly. The database is freely available to download from the ftp site: ftp://ftp.nectarmutation.org.

  15. microRNAs Databases: Developmental Methodologies, Structural and Functional Annotations.

    PubMed

    Singh, Nagendra Kumar

    2016-03-28

    microRNA (miRNA) is an endogenous and evolutionary conserved non-coding RNA, involved in post-transcriptional process as gene repressor and mRNA cleavage through RNA-induced silencing complex (RISC) formation. In RISC, miRNA binds in complementary base pair with targeted mRNA along with Argonaut proteins complex, causes gene repression or endonucleolytic cleavage of mRNAs and results in many diseases and syndromes. After the discovery of miRNA lin-4 and let-7, subsequently large numbers of miRNAs were discovered by low-throughput and high-throughput experimental techniques along with computational process in various biological and metabolic processes. The miRNAs are important non-coding RNA for understanding the complex biological phenomena of organism because it controls the gene regulation. This paper reviews miRNA databases with structural and functional annotations developed by various researchers. These databases contain structural and functional information of animal, plant and virus miRNAs including miRNAs-associated diseases, stress resistance in plant, miRNAs take part in various biological processes, effect of miRNAs interaction on drugs and environment, effect of variance on miRNAs, miRNAs gene expression analysis, sequence of miRNAs, structure of miRNAs. This review focuses on the developmental methodology of miRNA databases such as computational tools and methods used for extraction of miRNAs annotation from different resources or through experiment. This study also discusses the efficiency of user interface design of every database along with current entry and annotations of miRNA (pathways, gene ontology, disease ontology, etc.). Here, an integrated schematic diagram of construction process for databases is also drawn along with tabular and graphical comparison of various types of entries in different databases. Aim of this paper is to present the importance of miRNAs-related resources at a single place.

  16. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches.

  17. SATPdb: a database of structurally annotated therapeutic peptides.

    PubMed

    Singh, Sandeep; Chaudhary, Kumardeep; Dhanda, Sandeep Kumar; Bhalla, Sherry; Usmani, Salman Sadullah; Gautam, Ankur; Tuknait, Abhishek; Agrawal, Piyush; Mathur, Deepika; Raghava, Gajendra P S

    2016-01-04

    SATPdb (http://crdd.osdd.net/raghava/satpdb/) is a database of structurally annotated therapeutic peptides, curated from 22 public domain peptide databases/datasets including 9 of our own. The current version holds 19192 unique experimentally validated therapeutic peptide sequences having length between 2 and 50 amino acids. It covers peptides having natural, non-natural and modified residues. These peptides were systematically grouped into 10 categories based on their major function or therapeutic property like 1099 anticancer, 10585 antimicrobial, 1642 drug delivery and 1698 antihypertensive peptides. We assigned or annotated structure of these therapeutic peptides using structural databases (Protein Data Bank) and state-of-the-art structure prediction methods like I-TASSER, HHsearch and PEPstrMOD. In addition, SATPdb facilitates users in performing various tasks that include: (i) structure and sequence similarity search, (ii) peptide browsing based on their function and properties, (iii) identification of moonlighting peptides and (iv) searching of peptides having desired structure and therapeutic activities. We hope this database will be useful for researchers working in the field of peptide-based therapeutics.

  18. Vespucci: a system for building annotated databases of nascent transcripts

    PubMed Central

    Allison, Karmel A.; Kaikkonen, Minna U.; Gaasterland, Terry; Glass, Christopher K.

    2014-01-01

    Global run-on sequencing (GRO-seq) is a recent addition to the series of high-throughput sequencing methods that enables new insights into transcriptional dynamics within a cell. However, GRO-sequencing presents new algorithmic challenges, as existing analysis platforms for ChIP-seq and RNA-seq do not address the unique problem of identifying transcriptional units de novo from short reads located all across the genome. Here, we present a novel algorithm for de novo transcript identification from GRO-sequencing data, along with a system that determines transcript regions, stores them in a relational database and associates them with known reference annotations. We use this method to analyze GRO-sequencing data from primary mouse macrophages and derive novel quantitative insights into the extent and characteristics of non-coding transcription in mammalian cells. In doing so, we demonstrate that Vespucci expands existing annotations for mRNAs and lincRNAs by defining the primary transcript beyond the polyadenylation site. In addition, Vespucci generates assemblies for un-annotated non-coding RNAs such as those transcribed from enhancer-like elements. Vespucci thereby provides a robust system for defining, storing and analyzing diverse classes of primary RNA transcripts that are of increasing biological interest. PMID:24304890

  19. An Improved microRNA Annotation of the Canine Genome.

    PubMed

    Penso-Dolfin, Luca; Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions.

  20. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  1. A manual curation strategy to improve genome annotation: application to a set of haloarchael genomes.

    PubMed

    Pfeiffer, Friedhelm; Oesterhelt, Dieter

    2015-06-02

    Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.

  2. An annotated database of Arabidopsis mutants of acyl lipid metabolism

    SciTech Connect

    McGlew, Kathleen; Shaw, Vincent; Zhang, Meng; Kim, Ryeo Jin; Yang, Weili; Shorrosh, Basil; Suh, Mi Chung; Ohlrogge, John

    2014-12-10

    Mutants have played a fundamental role in gene discovery and in understanding the function of genes involved in plant acyl lipid metabolism. The first mutant in Arabidopsis lipid metabolism (fad4) was described in 1985. Since that time, characterization of mutants in more than 280 genes associated with acyl lipid metabolism has been reported. This review provides a brief background and history on identification of mutants in acyl lipid metabolism, an analysis of the distribution of mutants in different areas of acyl lipid metabolism and presents an annotated database (ARALIPmutantDB) of these mutants. The database provides information on the phenotypes of mutants, pathways and enzymes/proteins associated with the mutants, and allows rapid access via hyperlinks to summaries of information about each mutant and to literature that provides information on the lipid composition of the mutants. Mutants for at least 30 % of the genes in the database have multiple names, which have been compiled here to reduce ambiguities in searches for information. Furthermore, the database should also provide a tool for exploring the relationships between mutants in acyl lipid-related genes and their lipid phenotypes and point to opportunities for further research.

  3. An annotated database of Arabidopsis mutants of acyl lipid metabolism

    DOE PAGES

    McGlew, Kathleen; Shaw, Vincent; Zhang, Meng; ...

    2014-12-10

    Mutants have played a fundamental role in gene discovery and in understanding the function of genes involved in plant acyl lipid metabolism. The first mutant in Arabidopsis lipid metabolism (fad4) was described in 1985. Since that time, characterization of mutants in more than 280 genes associated with acyl lipid metabolism has been reported. This review provides a brief background and history on identification of mutants in acyl lipid metabolism, an analysis of the distribution of mutants in different areas of acyl lipid metabolism and presents an annotated database (ARALIPmutantDB) of these mutants. The database provides information on the phenotypes ofmore » mutants, pathways and enzymes/proteins associated with the mutants, and allows rapid access via hyperlinks to summaries of information about each mutant and to literature that provides information on the lipid composition of the mutants. Mutants for at least 30 % of the genes in the database have multiple names, which have been compiled here to reduce ambiguities in searches for information. Furthermore, the database should also provide a tool for exploring the relationships between mutants in acyl lipid-related genes and their lipid phenotypes and point to opportunities for further research.« less

  4. MitoFish and MitoAnnotator: a mitochondrial genome database of fish with an accurate and automatic annotation pipeline.

    PubMed

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-11-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

  5. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  6. MannDB: A microbial annotation database for protein characterization

    SciTech Connect

    Zhou, C; Lam, M; Smith, J; Zemla, A; Dyer, M; Kuczmarski, T; Vitalis, E; Slezak, T

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high

  7. Improving functional annotation for industrial microbes: a case study with Pichia pastoris

    PubMed Central

    Dikicioglu, Duygu; Wood, Valerie; Rutherford, Kim M.; McDowall, Mark D.; Oliver, Stephen G.

    2014-01-01

    The research communities studying microbial model organisms, such as Escherichia coli or Saccharomyces cerevisiae, are well served by model organism databases that have extensive functional annotation. However, this is not true of many industrial microbes that are used widely in biotechnology. In this Opinion piece, we use Pichia (Komagataella) pastoris to illustrate the limitations of the available annotation. We consider the resources that can be implemented in the short term both to improve Gene Ontology (GO) annotation coverage based on annotation transfer, and to establish curation pipelines for the literature corpus of this organism. PMID:24929579

  8. Building and Querying RDF/OWL Database of Semantically Annotated Nuclear Medicine Images.

    PubMed

    Hwang, Kyung Hoon; Lee, Haejun; Koh, Geon; Willrett, Debra; Rubin, Daniel L

    2017-02-01

    As the use of positron emission tomography-computed tomography (PET-CT) has increased rapidly, there is a need to retrieve relevant medical images that can assist image interpretation. However, the images themselves lack the explicit information needed for query. We constructed a semantically structured database of nuclear medicine images using the Annotation and Image Markup (AIM) format and evaluated the ability the AIM annotations to improve image search. We created AIM annotation templates specific to the nuclear medicine domain and used them to annotate 100 nuclear medicine PET-CT studies in AIM format using controlled vocabulary. We evaluated image retrieval from 20 specific clinical queries. As the gold standard, two nuclear medicine physicians manually retrieved the relevant images from the image database using free text search of radiology reports for the same queries. We compared query results with the manually retrieved results obtained by the physicians. The query performance indicated a 98 % recall for simple queries and a 89 % recall for complex queries. In total, the queries provided 95 % (75 of 79 images) recall, 100 % precision, and an F1 score of 0.97 for the 20 clinical queries. Three of the four images missed by the queries required reasoning for successful retrieval. Nuclear medicine images augmented using semantic annotations in AIM enabled high recall and precision for simple queries, helping physicians to retrieve the relevant images. Further study using a larger data set and the implementation of an inference engine may improve query results for more complex queries.

  9. An Improved microRNA Annotation of the Canine Genome

    PubMed Central

    Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  10. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  11. Combining computational models, semantic annotations and simulation experiments in a graph database

    PubMed Central

    Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar

    2015-01-01

    Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/ PMID:25754863

  12. Combining computational models, semantic annotations and simulation experiments in a graph database.

    PubMed

    Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar

    2015-01-01

    Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models' structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/

  13. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study.

    PubMed

    Costanzo, Maria C; Park, Julie; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2011-01-01

    Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned 'unknown' annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL: http://www.yeastgenome.org.

  14. CycADS: an annotation database system to ease the development and update of BioCyc databases

    PubMed Central

    Vellozo, Augusto F.; Véron, Amélie S.; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E.; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano

    2011-01-01

    In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http

  15. CycADS: an annotation database system to ease the development and update of BioCyc databases.

    PubMed

    Vellozo, Augusto F; Véron, Amélie S; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano

    2011-01-01

    In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http://www.cycadsys.org.

  16. Applying negative rule mining to improve genome annotation

    PubMed Central

    Artamonova, Irena I; Frishman, Goar; Frishman, Dmitrij

    2007-01-01

    Background Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection. PMID:17659089

  17. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

    PubMed

    O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D

    2016-01-04

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

  18. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

    PubMed Central

    O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  19. Evaluation of relational and NoSQL database architectures to manage genomic annotations.

    PubMed

    Schulz, Wade L; Nelson, Brent G; Felker, Donn K; Durant, Thomas J S; Torres, Richard

    2016-12-01

    While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences.

  20. BioBuilder as a database development and functional annotation platform for proteins

    PubMed Central

    Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

    2004-01-01

    Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

  1. ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.

    PubMed

    Zeng, Victor; Extavour, Cassandra G

    2012-01-01

    The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental

  2. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts.

    PubMed

    Yamasaki, Chisato; Murakami, Katsuhiko; Fujii, Yasuyuki; Sato, Yoshiharu; Harada, Erimi; Takeda, Jun-ichi; Taniya, Takayuki; Sakate, Ryuichi; Kikugawa, Shingo; Shimada, Makoto; Tanino, Motohiko; Koyanagi, Kanako O; Barrero, Roberto A; Gough, Craig; Chun, Hong-Woo; Habara, Takuya; Hanaoka, Hideki; Hayakawa, Yosuke; Hilton, Phillip B; Kaneko, Yayoi; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Matsuya, Akihiro; Nagata, Naoki; Nishikata, Kensaku; Noda, Akiko Ogura; Nurimoto, Shin; Saichi, Naomi; Sakai, Hiroaki; Sanbonmatsu, Ryoko; Shiba, Rie; Suzuki, Mami; Takabayashi, Kazuhiko; Takahashi, Aiko; Tamura, Takuro; Tanaka, Masayuki; Tanaka, Susumu; Todokoro, Fusano; Yamaguchi, Kaori; Yamamoto, Naoyuki; Okido, Toshihisa; Mashima, Jun; Hashizume, Aki; Jin, Lihua; Lee, Kyung-Bum; Lin, Yi-Chueh; Nozaki, Asami; Sakai, Katsunaga; Tada, Masahito; Miyazaki, Satoru; Makino, Takashi; Ohyanagi, Hajime; Osato, Naoki; Tanaka, Nobuhiko; Suzuki, Yoshiyuki; Ikeo, Kazuho; Saitou, Naruya; Sugawara, Hideaki; O'Donovan, Claire; Kulikova, Tamara; Whitfield, Eleanor; Halligan, Brian; Shimoyama, Mary; Twigger, Simon; Yura, Kei; Kimura, Kouichi; Yasuda, Tomohiro; Nishikawa, Tetsuo; Akiyama, Yutaka; Motono, Chie; Mukai, Yuri; Nagasaki, Hideki; Suwa, Makiko; Horton, Paul; Kikuno, Reiko; Ohara, Osamu; Lancet, Doron; Eveno, Eric; Graudens, Esther; Imbeaud, Sandrine; Debily, Marie Anne; Hayashizaki, Yoshihide; Amid, Clara; Han, Michael; Osanger, Andreas; Endo, Toshinori; Thomas, Michael A; Hirakawa, Mika; Makalowski, Wojciech; Nakao, Mitsuteru; Kim, Nam-Soon; Yoo, Hyang-Sook; De Souza, Sandro J; Bonaldo, Maria de Fatima; Niimura, Yoshihito; Kuryshev, Vladimir; Schupp, Ingo; Wiemann, Stefan; Bellgard, Matthew; Shionyu, Masafumi; Jia, Libin; Thierry-Mieg, Danielle; Thierry-Mieg, Jean; Wagner, Lukas; Zhang, Qinghua; Go, Mitiko; Minoshima, Shinsei; Ohtsubo, Masafumi; Hanada, Kousuke; Tonellato, Peter; Isogai, Takao; Zhang, Ji; Lenhard, Boris; Kim, Sangsoo; Chen, Zhu; Hinz, Ursula; Estreicher, Anne; Nakai, Kenta; Makalowska, Izabela; Hide, Winston; Tiffin, Nicola; Wilming, Laurens; Chakraborty, Ranajit; Soares, Marcelo Bento; Chiusano, Maria Luisa; Suzuki, Yutaka; Auffray, Charles; Yamaguchi-Kabata, Yumi; Itoh, Takeshi; Hishiki, Teruyoshi; Fukuchi, Satoshi; Nishikawa, Ken; Sugano, Sumio; Nomura, Nobuo; Tateno, Yoshio; Imanishi, Tadashi; Gojobori, Takashi

    2008-01-01

    Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein-protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group.

  3. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts*

    PubMed Central

    2008-01-01

    Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein–protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group. PMID:18089548

  4. Assessment of community-submitted ontology annotations from a novel database-journal partnership.

    PubMed

    Berardini, Tanya Z; Li, Donghui; Muller, Robert; Chetty, Raymond; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

    2012-01-01

    As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles' contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed. We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality. Database URL: www.arabidopsis.org.

  5. CHPVDB ‐ a sequence annotation database for Chandipura Virus

    PubMed Central

    Dikhit, Manas Ranjan; Rana, Sindhu Prava; Das, Pradeep; Sahoo, Ganesh Chandra

    2009-01-01

    Databases containing proteomic information have become indispensable for virology studies. As the gap between the amount of sequence information and functional characterization widens, increasing efforts are being directed to the development of databases. For virologist, it is therefore desirable to have a single data collection point which integrates research related data from different domains. CHPVDB is our effort to provide virologist such a one‐step information center. We describe herein the creation of CHPVDB, a new database that integrates information of different proteins in to a single resource. For basic curation of protein information, the database relies on features from other selected databases, servers and published reports. This database facilitates significant relationship between molecular analysis, cleavage sites, possible protein functional families assigned to different proteins of Chandipura virus (CHPV) by SVMProt and related tools. Availability The database is freely available at http://chpvdb.biomedinformri.org/. PMID:19293996

  6. SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms

    PubMed Central

    Chelala, Claude; Khan, Arshad; Lemoine, Nicholas R

    2009-01-01

    Motivation: Design a new computational tool allowing scientists to functionally annotate newly discovered and public domain single nucleotide polymorphisms in order to help in prioritizing targets in further disease studies and large-scale genotyping projects. Summary: SNPnexus database provides functional annotation for both novel and public SNPs. Possible effects on the transcriptome and proteome levels are characterized and reported from five major annotation systems providing the most extensive information on alternative splicing. Additional information on HapMap genotype and allele frequency, overlaps with potential regulatory elements or structural variations as well as related genetic diseases can be also retrieved. The SNPnexus database has a user-friendly web interface, providing single or batch query options using SNP identifiers from dbSNP as well as genomic location on clones, contigs or chromosomes. Therefore, SNPnexus is the only database currently providing a complete set of functional annotations of SNPs in public databases and newly detected from sequencing projects. Hence, we describe SNPnexus, provide details of the query options, the annotation categories as well as biological examples of use. Availability: The SNPnexus database is freely available at http://www.snp-nexus.org. Contact: claude.chelala@cancer.org.uk PMID:19098027

  7. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.

    PubMed

    Haas, Brian J; Delcher, Arthur L; Mount, Stephen M; Wortman, Jennifer R; Smith, Roger K; Hannick, Linda I; Maiti, Rama; Ronning, Catherine M; Rusch, Douglas B; Town, Christopher D; Salzberg, Steven L; White, Owen

    2003-10-01

    The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

  8. GBshape: a genome browser database for DNA shape annotations.

    PubMed

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J; Parker, Stephen C J; Nuzhdin, Sergey V; Tullius, Thomas D; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.

  9. CEBS: a comprehensive annotated database of toxicological data

    PubMed Central

    Lea, Isabel A.; Gong, Hui; Paleja, Anand; Rashid, Asif; Fostel, Jennifer

    2017-01-01

    The Chemical Effects in Biological Systems database (CEBS) is a comprehensive and unique toxicology resource that compiles individual and summary animal data from the National Toxicology Program (NTP) testing program and other depositors into a single electronic repository. CEBS has undergone significant updates in recent years and currently contains over 11 000 test articles (exposure agents) and over 8000 studies including all available NTP carcinogenicity, short-term toxicity and genetic toxicity studies. Study data provided to CEBS are manually curated, accessioned and subject to quality assurance review prior to release to ensure high quality. The CEBS database has two main components: data collection and data delivery. To accommodate the breadth of data produced by NTP, the CEBS data collection component is an integrated relational design that allows the flexibility to capture any type of electronic data (to date). The data delivery component of the database comprises a series of dedicated user interface tables containing pre-processed data that support each component of the user interface. The user interface has been updated to include a series of nine Guided Search tools that allow access to NTP summary and conclusion data and larger non-NTP datasets. The CEBS database can be accessed online at http://www.niehs.nih.gov/research/resources/databases/cebs/. PMID:27899660

  10. The Flux: creating a large annotated image database

    NASA Astrophysics Data System (ADS)

    Tamburrino, Daniel; Schönmann, Patrick; Vandewalle, Patrick; Süsstrunk, Sabine

    2008-01-01

    From image retrieval to image classification, all research shares one common requirement: a good image database to test or train the algorithms. In order to create a large database of images, we set up a project that allowed gathering a collection of more than 33000 photographs with keywords and tags from all over the world. This project was part of the "We Are All Photographers Now!" exhibition at the Musee de l'Elysee in Lausanne, Switzerland. The "Flux," as it was called, gave all photographers, professional or amateur, the opportunity to have their images shown in the museum. Anyone could upload pictures on a website. We required that some simple tags were filled in. Keywords were optional. The information was collected in a MySQL database along with the original photos. The pictures were projected at the museum in five second intervals. A webcam snapshot was taken and sent back to the photographers via email to show how and when their image was displayed at the museum. During the 14 weeks of the exhibition, we collected more than 33000 JPEG pictures with tags and keywords. These pictures come from 133 countries and were taken by 9042 different photographers. This database can be used for non-commercial research at EPFL. We present some preliminary analysis here.

  11. A Tunable Machine Vision-based Strategy for Automated Annotation of Chemical Databases

    PubMed Central

    Park, Jungkap; Rosania, Gus R.; Saitou, Kazuhiro

    2010-01-01

    We present a tunable, machine vision-based strategy for automated annotation of virtual small molecule databases. The proposed strategy is based on the use of a machine vision based tool for extracting structure diagrams in research articles and converting them into connection tables, a virtual “Chemical Expert” system for screening the converted structures based on the adjustable levels of estimated conversion accuracy, and a fragment-based measure for calculating intermolecular similarity. For annotation, calculated chemical similarity between the converted structures and entries in a virtual small molecule database is used to establish the links. The overall annotation performances can be tuned by adjusting the cutoff threshold of the estimated conversion accuracy. We performed an annotation test which attempts to link 121 journal articles registered in the PubMed to entries in the PubChem which is the largest, publicly accessible chemical database. Two cases of tests are performed and their results are compared to see how the overall annotation performances are affected by the different threshold levels of the estimated accuracy of the converted structure. Our work demonstrates that over 45% of articles could have true positive links to entries in the PubChem database with promising recall and precision rates in both tests. Furthermore, we illustrates that Chemical Expert system which can screen the converted structures based on the adjustable levels of estimated conversion accuracy is a key factor impacting the overall annotation performance. We propose that this machine vision based strategy can be incorporated with the text-mining approach to facilitate extraction of contextual scientific knowledge about a chemical structure, from the scientific literature. PMID:19621901

  12. Improving the Caenorhabditis elegans genome annotation using machine learning.

    PubMed

    Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan, Jagan; Witte, Hanh; Müller, Klaus-R; Sommer, Ralf-J; Schölkopf, Bernhard

    2007-02-23

    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

  13. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.

    PubMed

    Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2012-01-01

    The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org.

  14. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported

  15. CSE database: extended annotations and new recommendations for ECG software testing.

    PubMed

    Smíšek, Radovan; Maršánová, Lucie; Němcová, Andrea; Vítek, Martin; Kozumplík, Jiří; Nováková, Marie

    2016-12-31

    Nowadays, cardiovascular diseases represent the most common cause of death in western countries. Among various examination techniques, electrocardiography (ECG) is still a highly valuable tool used for the diagnosis of many cardiovascular disorders. In order to diagnose a person based on ECG, cardiologists can use automatic diagnostic algorithms. Research in this area is still necessary. In order to compare various algorithms correctly, it is necessary to test them on standard annotated databases, such as the Common Standards for Quantitative Electrocardiography (CSE) database. According to Scopus, the CSE database is the second most cited standard database. There were two main objectives in this work. First, new diagnoses were added to the CSE database, which extended its original annotations. Second, new recommendations for diagnostic software quality estimation were established. The ECG recordings were diagnosed by five new cardiologists independently, and in total, 59 different diagnoses were found. Such a large number of diagnoses is unique, even in terms of standard databases. Based on the cardiologists' diagnoses, a four-round consensus (4R consensus) was established. Such a 4R consensus means a correct final diagnosis, which should ideally be the output of any tested classification software. The accuracy of the cardiologists' diagnoses compared with the 4R consensus was the basis for the establishment of accuracy recommendations. The accuracy was determined in terms of sensitivity = 79.20-86.81%, positive predictive value = 79.10-87.11%, and the Jaccard coefficient = 72.21-81.14%, respectively. Within these ranges, the accuracy of the software is comparable with the accuracy of cardiologists. The accuracy quantification of the correct classification is unique. Diagnostic software developers can objectively evaluate the success of their algorithm and promote its further development. The annotations and recommendations proposed in this work will allow

  16. Annotated checklist and database for vascular plants of the Jemez Mountains

    SciTech Connect

    Foxx, T. S.; Pierce, L.; Tierney, G. D.; Hansen, L. A.

    1998-03-01

    Studies done in the last 40 years have provided information to construct a checklist of the Jemez Mountains. The present database and checklist builds on the basic list compiled by Teralene Foxx and Gail Tierney in the early 1980s. The checklist is annotated with taxonomic information, geographic and biological information, economic uses, wildlife cover, revegetation potential, and ethnographic uses. There are nearly 1000 species that have been noted for the Jemez Mountains. This list is cross-referenced with the US Department of Agriculture Natural Resource Conservation Service PLANTS database species names and acronyms. All information will soon be available on a Web Page.

  17. circRNADb: A comprehensive database for human circular RNAs with protein-coding annotations

    PubMed Central

    Chen, Xiaoping; Han, Ping; Zhou, Tao; Guo, Xuejiang; Song, Xiaofeng; Li, Yan

    2016-01-01

    It has been known that circular RNAs are widely expressed in human tissues and cells, and play important regulatory roles in physiological or pathological processes. However, there is lack of comprehensively annotated human circular RNAs database. In this study we established a circRNA database, named as circRNADb, containing 32,914 human exonic circRNAs carefully selected from diversified sources. The detailed information of the circRNA, including genomic information, exon splicing, genome sequence, internal ribosome entry site (IRES), open reading frame (ORF) and references were provided in circRNADb. In addition, circRNAs were found to be able to encode proteins, which have not been reported in any species. 16328 circRNAs were annotated to have ORF longer than 100 amino acids, of which 7170 have IRES elements. 46 circRNAs from 37 genes were found to have their corresponding proteins expressed according mass spectrometry. The database provides the function of data search, browse, download, submit and feedback for the user to study particular circular RNA of interest and update the database continually. circRNADb will be built to be a biological information platform for circRNA molecules and related biological functions in the future. The database can be freely available through the web server at http://reprod.njmu.edu.cn/circrnadb. PMID:27725737

  18. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

    PubMed

    Schnoes, Alexandra M; Brown, Shoshana D; Dodevski, Igor; Babbitt, Patricia C

    2009-12-01

    Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

  19. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-12-02

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.

  20. Construction of customized sub-databases from NCBI-nr database for rapid annotation of huge metagenomic datasets using a combined BLAST and MEGAN approach.

    PubMed

    Yu, Ke; Zhang, Tong

    2013-01-01

    We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1(st) BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1(st) BLAST were subjected to the 2(nd) BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1(st) BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ~150-385 times at the BLAST e-value of 1e-5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e-10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks.

  1. Construction of Customized Sub-Databases from NCBI-nr Database for Rapid Annotation of Huge Metagenomic Datasets Using a Combined BLAST and MEGAN Approach

    PubMed Central

    Yu, Ke; Zhang, Tong

    2013-01-01

    We developed a fast method to construct local sub-databases from the NCBI-nr database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN approach. A three-step sub-database annotation pipeline (SAP) was further proposed to conduct the annotation in a much more time-efficient way which required far less computational capacity than the direct NCBI-nr database BLAST-MEGAN approach. The 1st BLAST of SAP was conducted using the original metagenomic dataset against the constructed sub-database for a quick screening of candidate target sequences. Then, the candidate target sequences identified in the 1st BLAST were subjected to the 2nd BLAST against the whole NCBI-nr database. The BLAST results were finally annotated using MEGAN to filter out those mistakenly selected sequences in the 1st BLAST to guarantee the accuracy of the results. Based on the tests conducted in this study, SAP achieved a speedup of ∼150–385 times at the BLAST e-value of 1e–5, compared to the direct BLAST against NCBI-nr database. The annotation results of SAP are exactly in agreement with those of the direct NCBI-nr database BLAST-MEGAN approach, which is very time-consuming and computationally intensive. Selecting rigorous thresholds (e.g. e-value of 1e–10) would further accelerate SAP process. The SAP pipeline may also be coupled with novel similarity search tools (e.g. RAPsearch) other than BLAST to achieve even faster annotation of huge metagenomic datasets. Above all, this sub-database construction method and SAP pipeline provides a new time-efficient and convenient annotation similarity search strategy for laboratories without access to high performance computing facilities. SAP also offers a solution to high performance computing facilities for the processing of more similarity search tasks. PMID:23573212

  2. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  3. Design and implementation of a database for Brucella melitensis genome annotation.

    PubMed

    De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

    2008-03-18

    The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.

  4. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  5. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  6. BambooGDB: a bamboo genome database with functional annotation and an analysis platform.

    PubMed

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.

  7. RefSeq microbial genomes database: new representation and annotation strategy

    PubMed Central

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O’Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks. PMID:24316578

  8. RefSeq microbial genomes database: new representation and annotation strategy.

    PubMed

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.

  9. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    PubMed Central

    2011-01-01

    Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets. PMID:22192575

  10. ASAP: automated sequence annotation pipeline for web-based updating of sequence information with a local dynamic database.

    PubMed

    Kossenkov, Andrew; Manion, Frank J; Korotkov, Eugene; Moloshok, Thomas D; Ochs, Michael F

    2003-03-22

    The automated sequence annotation pipeline (ASAP) is designed to ease routine investigation of new functional annotations on unknown sequences, such as expressed sequence tags (ESTs), through querying of web-accessible resources and maintenance of a local database. The system allows easy use of the output from one search as the input for a new search, as well as the filtering of results. The database is used to store formats and parameters and information for parsing data from web sites. The database permits easy updating of format information should a site modify the format of a query or of a returned web page.

  11. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.

    PubMed

    Zou, Quan; Guo, Jiasheng; Ju, Ying; Wu, Meihong; Zeng, Xiangxiang; Hong, Zhiling

    2015-11-01

    tRNAScan-SE is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNAScan-SE is unacceptable for large sequences. Here, we used a machine learning method to try to improve the tRNAScan-SE results. A new predictor, tRNA-Predict, was designed. We obtained real and pseudo-tRNA sequences as training data sets using tRNAScan-SE and constructed three different tRNA feature sets. We then set up an ensemble classifier, LibMutil, to predict tRNAs from the training data. The positive data set of 623 tRNA sequences was obtained from tRNAdb 2009 and the negative data set was the false positive tRNAs predicted by tRNAscan-SE. Our in silico experiments revealed a prediction accuracy rate of 95.1 % for tRNA-Predict using 10-fold cross-validation. tRNA-Predict was developed to distinguish functional tRNAs from pseudo-tRNAs rather than to predict tRNAs from a genome-wide scan. However, tRNA-Predict can work with the output of tRNAscan-SE, which is a genome-wide scanning method, to improve the tRNAscan-SE annotation results. The tRNA-Predict web server is accessible at http://datamining.xmu.edu.cn/∼gjs/tRNA-Predict.

  12. A statistical framework for improving genomic annotations of prokaryotic essential genes.

    PubMed

    Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

    2013-01-01

    Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our

  13. CyanoBase and RhizoBase: databases of manually curated annotations for cyanobacterial and rhizobial genomes.

    PubMed

    Fujisawa, Takatomo; Okamoto, Shinobu; Katayama, Toshiaki; Nakao, Mitsuteru; Yoshimura, Hidehisa; Kajiya-Kanegae, Hiromi; Yamamoto, Sumiko; Yano, Chiyoko; Yanaka, Yuka; Maita, Hiroko; Kaneko, Takakazu; Tabata, Satoshi; Nakamura, Yasukazu

    2014-01-01

    To understand newly sequenced genomes of closely related species, comprehensively curated reference genome databases are becoming increasingly important. We have extended CyanoBase (http://genome.microbedb.jp/cyanobase), a genome database for cyanobacteria, and newly developed RhizoBase (http://genome.microbedb.jp/rhizobase), a genome database for rhizobia, nitrogen-fixing bacteria associated with leguminous plants. Both databases focus on the representation and reusability of reference genome annotations, which are continuously updated by manual curation. Domain experts have extracted names, products and functions of each gene reported in the literature. To ensure effectiveness of this procedure, we developed the TogoAnnotation system offering a web-based user interface and a uniform storage of annotations for the curators of the CyanoBase and RhizoBase databases. The number of references investigated for CyanoBase increased from 2260 in our previous report to 5285, and for RhizoBase, we perused 1216 references. The results of these intensive annotations are displayed on the GeneView pages of each database. Advanced users can also retrieve this information through the representational state transfer-based web application programming interface in an automated manner.

  14. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures

    PubMed Central

    Paladin, Lisanna; Hirsh, Layla; Piovesan, Damiano; Andrade-Navarro, Miguel A.; Kajava, Andrey V.; Tosatto, Silvio C.E.

    2017-01-01

    RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity. PMID:27899671

  15. SureChEMBL: a large-scale, chemically annotated patent document database

    PubMed Central

    Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

    2016-01-01

    SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922

  16. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

    PubMed

    Arloth, Janine; Bader, Daniel M; Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer's annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2.

  17. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture.

    PubMed

    Ghaffari, Noushin; Sanchez-Flores, Alejandro; Doan, Ryan; Garcia-Orozco, Karina D; Chen, Patricia L; Ochoa-Leyva, Adrian; Lopez-Zavala, Alonso A; Carrasco, J Salvador; Hong, Chris; Brieba, Luis G; Rudiño-Piñera, Enrique; Blood, Philip D; Sawyer, Jason E; Johnson, Charles D; Dindot, Scott V; Sotelo-Mundo, Rogerio R; Criscitiello, Michael F

    2014-11-25

    We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance.

  18. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture

    PubMed Central

    Ghaffari, Noushin; Sanchez-Flores, Alejandro; Doan, Ryan; Garcia-Orozco, Karina D.; Chen, Patricia L.; Ochoa-Leyva, Adrian; Lopez-Zavala, Alonso A.; Carrasco, J. Salvador; Hong, Chris; Brieba, Luis G.; Rudiño-Piñera, Enrique; Blood, Philip D.; Sawyer, Jason E.; Johnson, Charles D.; Dindot, Scott V.; Sotelo-Mundo, Rogerio R.; Criscitiello, Michael F.

    2014-01-01

    We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance. PMID:25420880

  19. Missing semantic annotation in databases. The root cause for data integration and migration problems in information systems.

    PubMed

    Dugas, M

    2014-01-01

    Data integration is a well-known grand challenge in information systems. It is highly relevant in medicine because of the multitude of patient data sources. Semantic annotations of data items regarding concept and value domain, based on comprehensive terminologies can facilitate data integration and migration. Therefore it should be implemented in databases from the very beginning.

  20. MITOS: improved de novo metazoan mitochondrial genome annotation.

    PubMed

    Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F

    2013-11-01

    About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.

  1. Quality of computationally inferred gene ontology annotations.

    PubMed

    Skunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-05-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon-an important outcome given that >98% of all annotations are inferred without direct curation.

  2. Modeling loosely annotated images using both given and imagined annotations

    NASA Astrophysics Data System (ADS)

    Tang, Hong; Boujemaa, Nozha; Chen, Yunhao; Deng, Lei

    2011-12-01

    In this paper, we present an approach to learn latent semantic analysis models from loosely annotated images for automatic image annotation and indexing. The given annotation in training images is loose due to: 1. ambiguous correspondences between visual features and annotated keywords; 2. incomplete lists of annotated keywords. The second reason motivates us to enrich the incomplete annotation in a simple way before learning a topic model. In particular, some ``imagined'' keywords are poured into the incomplete annotation through measuring similarity between keywords in terms of their co-occurrence. Then, both given and imagined annotations are employed to learn probabilistic topic models for automatically annotating new images. We conduct experiments on two image databases (i.e., Corel and ESP) coupled with their loose annotations, and compare the proposed method with state-of-the-art discrete annotation methods. The proposed method improves word-driven probability latent semantic analysis (PLSA-words) up to a comparable performance with the best discrete annotation method, while a merit of PLSA-words is still kept, i.e., a wider semantic range.

  3. ModBase, a database of annotated comparative protein structure models and associated resources

    PubMed Central

    Pieper, Ursula; Webb, Benjamin M.; Dong, Guang Qiang; Schneidman-Duhovny, Dina; Fan, Hao; Kim, Seung Joong; Khuri, Natalia; Spill, Yannick G.; Weinkam, Patrick; Hammel, Michal; Tainer, John A.; Nilges, Michael; Sali, Andrej

    2014-01-01

    ModBase (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by ModPipe, an automated modeling pipeline that relies primarily on Modeller for fold assignment, sequence-structure alignment, model building and model assessment (http://salilab.org/modeller/). ModBase currently contains almost 30 million reliable models for domains in 4.7 million unique protein sequences. ModBase allows users to compute or update comparative models on demand, through an interface to the ModWeb modeling server (http://salilab.org/modweb). ModBase models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/). Recently developed associated resources include the AllosMod server for modeling ligand-induced protein dynamics (http://salilab.org/allosmod), the AllosMod-FoXS server for predicting a structural ensemble that fits an SAXS profile (http://salilab.org/allosmod-foxs), the FoXSDock server for protein–protein docking filtered by an SAXS profile (http://salilab.org/foxsdock), the SAXS Merge server for automatic merging of SAXS profiles (http://salilab.org/saxsmerge) and the Pose & Rank server for scoring protein–ligand complexes (http://salilab.org/poseandrank). In this update, we also highlight two applications of ModBase: a PSI:Biology initiative to maximize the structural coverage of the human alpha-helical transmembrane proteome and a determination of structural determinants of human immunodeficiency virus-1 protease specificity. PMID:24271400

  4. Genome-Wide Enzyme Annotation with Precision Control: Catalytic Families (CatFam) Databases

    DTIC Science & Technology

    2008-01-01

    classification, decision trees , association rules, neural networks,18 and support vector machines,19,20 to classify protein catalytic func- tions using various...genomes To evaluate the performance of CatFam for whole ge- nome annotation, we select two Yersinia genomes [Y. pes- tis mediaevails (ypm) and Y...pestis and F. tularensis Organism Annotation Number of enzyme-catalyzed reactions Number of predicted pathways Number of pathways with holes Yersinia

  5. The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012

    PubMed Central

    Attwood, Teresa K.; Coletta, Alain; Muirhead, Gareth; Pavlopoulou, Athanasia; Philippou, Peter B.; Popov, Ivan; Romá-Mateo, Carlos; Theodosiou, Athina; Mitchell, Alex L.

    2012-01-01

    The PRINTS database, now in its 21st year, houses a collection of diagnostic protein family ‘fingerprints’. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. As such, they may be used to assign uncharacterized sequences to known families, and hence to infer tentative functional, structural and/or evolutionary relationships. The February 2012 release (version 42.0) includes 2156 fingerprints, encoding 12 444 individual motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Here, we report the current status of the database, and introduce a number of recent developments that help both to render a variety of our annotation and analysis tools easier to use and to make them more widely available. Database URL: www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/ PMID:22508994

  6. Improving HIV proteome annotation: new features of BioAfrica HIV Proteomics Resource

    PubMed Central

    Druce, Megan; Hulo, Chantal; Masson, Patrick; Sommer, Paula; Xenarios, Ioannis; Le Mercier, Philippe; De Oliveira, Tulio

    2016-01-01

    The Human Immunodeficiency Virus (HIV) is one of the pathogens that cause the greatest global concern, with approximately 35 million people currently infected with HIV. Extensive HIV research has been performed, generating a large amount of HIV and host genomic data. However, no effective vaccine that protects the host from HIV infection is available and HIV is still spreading at an alarming rate, despite effective antiretroviral (ARV) treatment. In order to develop effective therapies, we need to expand our knowledge of the interaction between HIV and host proteins. In contrast to virus proteins, which often rapidly evolve drug resistance mutations, the host proteins are essentially invariant within all humans. Thus, if we can identify the host proteins needed for virus replication, such as those involved in transporting viral proteins to the cell surface, we have a chance of interrupting viral replication. There is no proteome resource that summarizes this interaction, making research on this subject a difficult enterprise. In order to fill this gap in knowledge, we curated a resource presents detailed annotation on the interaction between the HIV proteome and host proteins. Our resource was produced in collaboration with ViralZone and used manual curation techniques developed by UniProtKB/Swiss-Prot. Our new website also used previous annotations of the BioAfrica HIV-1 Proteome Resource, which has been accessed by approximately 10 000 unique users a year since its inception in 2005. The novel features include a dedicated new page for each HIV protein, a graphic display of its function and a section on its interaction with host proteins. Our new webpages also add information on the genomic location of each HIV protein and the position of ARV drug resistance mutations. Our improved BioAfrica HIV-1 Proteome Resource fills a gap in the current knowledge of biocuration. Database URL: http://www.bioafrica.net/proteomics/HIVproteome.html PMID:27087306

  7. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL:http://rgd.mcw.edu.

  8. The Disease Portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database

    PubMed Central

    Hayman, G. Thomas; Laulederkind, Stanley J. F.; Smith, Jennifer R.; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene–disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL: http://rgd.mcw.edu PMID:27009807

  9. Strategies for annotation and curation of translational databases: the eTUMOUR project.

    PubMed

    Julià-Sapé, Margarida; Lurgi, Miguel; Mier, Mariola; Estanyol, Francesc; Rafael, Xavier; Candiota, Ana Paula; Barceló, Anna; García, Alina; Martínez-Bisbal, M Carmen; Ferrer-Luna, Rubén; Moreno-Torres, Ángel; Celda, Bernardo; Arús, Carles

    2012-01-01

    The eTUMOUR (eT) multi-centre project gathered in vivo and ex vivo magnetic resonance (MR) data, as well as transcriptomic and clinical information from brain tumour patients, with the purpose of improving the diagnostic and prognostic evaluation of future patients. In order to carry this out, among other work, a database--the eTDB--was developed. In addition to complex permission rules and software and management quality control (QC), it was necessary to develop anonymization, processing and data visualization tools for the data uploaded. It was also necessary to develop sophisticated curation strategies that involved on one hand, dedicated fields for QC-generated meta-data and specialized queries and global permissions for senior curators and on the other, to establish a set of metrics to quantify its contents. The indispensable dataset (ID), completeness and pairedness indices were set. The database contains 1317 cases created as a result of the eT project and 304 from a previous project, INTERPRET. The number of cases fulfilling the ID was 656. Completeness and pairedness were heterogeneous, depending on the data type involved.

  10. A computational platform to maintain and migrate manual functional annotations for BioCyc databases

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Model organism databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integratio...

  11. Citrus sinensis annotation project (CAP): a comprehensive database for sweet orange genome.

    PubMed

    Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  12. A kingdom-specific protein domain HMM library for improved annotation of fungal genomes

    PubMed Central

    Alam, Intikhab; Hubbard, Simon J; Oliver, Stephen G; Rattray, Magnus

    2007-01-01

    Background Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs), which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models. Results Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam) using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied. Conclusion Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam. Therefore, we recommend

  13. Strategies for annotation and curation of translational databases: the eTUMOUR project

    PubMed Central

    Julià-Sapé, Margarida; Lurgi, Miguel; Mier, Mariola; Estanyol, Francesc; Rafael, Xavier; Candiota, Ana Paula; Barceló, Anna; García, Alina; Martínez-Bisbal, M. Carmen; Ferrer-Luna, Rubén; Moreno-Torres, Àngel; Celda, Bernardo; Arús, Carles

    2012-01-01

    The eTUMOUR (eT) multi-centre project gathered in vivo and ex vivo magnetic resonance (MR) data, as well as transcriptomic and clinical information from brain tumour patients, with the purpose of improving the diagnostic and prognostic evaluation of future patients. In order to carry this out, among other work, a database—the eTDB—was developed. In addition to complex permission rules and software and management quality control (QC), it was necessary to develop anonymization, processing and data visualization tools for the data uploaded. It was also necessary to develop sophisticated curation strategies that involved on one hand, dedicated fields for QC-generated meta-data and specialized queries and global permissions for senior curators and on the other, to establish a set of metrics to quantify its contents. The indispensable dataset (ID), completeness and pairedness indices were set. The database contains 1317 cases created as a result of the eT project and 304 from a previous project, INTERPRET. The number of cases fulfilling the ID was 656. Completeness and pairedness were heterogeneous, depending on the data type involved. PMID:23180768

  14. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database

    PubMed Central

    2011-01-01

    Background The genus Silene is widely used as a model system for addressing ecological and evolutionary questions in plants, but advances in using the genus as a model system are impeded by the lack of available resources for studying its genome. Massively parallel sequencing cDNA has recently developed into an efficient method for characterizing the transcriptomes of non-model organisms, generating massive amounts of data that enable the study of multiple species in a comparative framework. The sequences generated provide an excellent resource for identifying expressed genes, characterizing functional variation and developing molecular markers, thereby laying the foundations for future studies on gene sequence and gene expression divergence. Here, we report the results of a comparative transcriptome sequencing study of eight individuals representing four Silene and one Dianthus species as outgroup. All sequences and annotations have been deposited in a newly developed and publicly available database called SiESTa, the Silene EST annotation database. Results A total of 1,041,122 EST reads were generated in two runs on a Roche GS-FLX 454 pyrosequencing platform. EST reads were analyzed separately for all eight individuals sequenced and were assembled into contigs using TGICL. These were annotated with results from BLASTX searches and Gene Ontology (GO) terms, and thousands of single-nucleotide polymorphisms (SNPs) were characterized. Unassembled reads were kept as singletons and together with the contigs contributed to the unigenes characterized in each individual. The high quality of unigenes is evidenced by the proportion (49%) that have significant hits in similarity searches with the A. thaliana proteome. The SiESTa database is accessible at http://www.siesta.ethz.ch. Conclusion The sequence collections established in the present study provide an important genomic resource for four Silene and one Dianthus species and will help to further develop Silene as a

  15. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop.

    PubMed

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-10-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world's biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop.

  16. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop

    PubMed Central

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J.; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H.; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-01-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world’s biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop. PMID:21994619

  17. Citrus sinensis Annotation Project (CAP): A Comprehensive Database for Sweet Orange Genome

    PubMed Central

    Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/. PMID:24489955

  18. A database of annotated tentative orthologs from crop abiotic stress transcripts.

    PubMed

    Balaji, Jayashree; Crouch, Jonathan H; Petite, Prasad V N S; Hoisington, David A

    2006-10-07

    A minimal requirement to initiate a comparative genomics study on plant responses to abiotic stresses is a dataset of orthologous sequences. The availability of a large amount of sequence information, including those derived from stress cDNA libraries allow for the identification of stress related genes and orthologs associated with the stress response. Orthologous sequences serve as tools to explore genes and their relationships across species. For this purpose, ESTs from stress cDNA libraries across 16 crop species including 6 important cereal crops and 10 dicots were systematically collated and subjected to bioinformatics analysis such as clustering, grouping of tentative orthologous sets, identification of protein motifs/patterns in the predicted protein sequence, and annotation with stress conditions, tissue/library source and putative function. All data are available to the scientific community at http://intranet.icrisat.org/gt1/tog/homepage.htm. We believe that the availability of annotated plant abiotic stress ortholog sets will be a valuable resource for researchers studying the biology of environmental stresses in plant systems, molecular evolution and genomics.

  19. Improving Automated Annotation of Benthic Survey Images Using Wide-band Fluorescence

    PubMed Central

    Beijbom, Oscar; Treibitz, Tali; Kline, David I.; Eyal, Gal; Khen, Adi; Neal, Benjamin; Loya, Yossi; Mitchell, B. Greg; Kriegman, David

    2016-01-01

    Large-scale imaging techniques are used increasingly for ecological surveys. However, manual analysis can be prohibitively expensive, creating a bottleneck between collected images and desired data-products. This bottleneck is particularly severe for benthic surveys, where millions of images are obtained each year. Recent automated annotation methods may provide a solution, but reflectance images do not always contain sufficient information for adequate classification accuracy. In this work, the FluorIS, a low-cost modified consumer camera, was used to capture wide-band wide-field-of-view fluorescence images during a field deployment in Eilat, Israel. The fluorescence images were registered with standard reflectance images, and an automated annotation method based on convolutional neural networks was developed. Our results demonstrate a 22% reduction of classification error-rate when using both images types compared to only using reflectance images. The improvements were large, in particular, for coral reef genera Platygyra, Acropora and Millepora, where classification recall improved by 38%, 33%, and 41%, respectively. We conclude that convolutional neural networks can be used to combine reflectance and fluorescence imagery in order to significantly improve automated annotation accuracy and reduce the manual annotation bottleneck. PMID:27021133

  20. Eight Elements of High School Improvement: An Annotated Bibliography

    ERIC Educational Resources Information Center

    Jones, Wehmah

    2009-01-01

    This paper provides a table that contains a summary of resources available that support the National High School Center's "Eight Elements of High School Improvement: A Mapping Framework". This framework is based on eight common elements of systemic school reform and provides a structure for understanding, aligning, and prioritizing…

  1. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  2. A Post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs

    PubMed Central

    Swain, Martin T.; Tsai, Isheng J.; Assefa, Samual A.; Newbold, Chris; Berriman, Matthew; Otto, Thomas D.

    2013-01-01

    Genome projects now produce draft assemblies within weeks thanks to advanced high-throughput sequencing technologies. For milestone projects like E. coli or H. sapiens, teams of scientists were employed to manually curate and finish these genomes to a high standard. Nowadays, this is not feasible for most projects and the quality of genomes is generally of a much lower standard. This protocol describes software (PAGIT, post-assembly genome-improvement toolkit) to improve the quality of draft genomes. It offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence, and to exploit reference genomes (if available) for improving scaffolding and generating annotations. The protocol is most accessible for bacterial and small Eukaryotic genomes (up to 300 Mb), such as pathogenic bacteria, malaria and parasitic worms. Applying PAGIT to an E. coli assembly takes approximately 24 hours: it doubles the average contig size and annotates over 4300 gene models. PMID:22678431

  3. An annotated bibliography of selected guides for stream habitat improvement in the Pacific Northwest

    USGS Publications Warehouse

    Keim, R.F.; Price, A.B.; Hardin, T. S.; Skaugset, Arne E.; Bateman, D.S.; Gresswell, R.E.; Tesch, S. D.

    2004-01-01

    This annotated bibliography is a response to widespread interest in stream habitat improvement in the Pacific Northwest by land managers, governmental and nongovernmental organizations, and the lay public. Several guides to stream habitat improvement have been written in the past, but may not be easily accessible to people from diverse backgrounds. This annotated bibliography reviews 11 guides to stream habitat improvement so that readers can find literature appropriate to their needs. All reviews begin with summaries of the contents, stated audiences, and goals of each guide. Reviews also include subjective comments on the strengths and weaknesses of each guide. Finally, this bibliography includes recommendations of guides and combinations of guides judged most useful for a range of purposes. 

  4. GELBANK : A database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes.

    SciTech Connect

    Babnigg, G.; Giometti, C. S.; Biosciences Division

    2004-01-01

    GELBANK is a publicly available database of two-dimensional gel electrophoresis (2DE) gel patterns of proteomes from organisms with known genome information (available at and ftp://bioinformatics.anl.gov/gelbank/). Currently it includes 131 completed, mostly microbial proteomes available from the National Center for Biotechnology Information. A web interface allows the upload of 2D gel patterns and their annotation for registered users. The images are organized by species, tissue type, separation method, sample type and staining method. The database can be queried based on protein or 2DE-pattern attributes. A web interface allows registered users to assign molecular weight and pH gradient profiles to their own 2D gel patterns as well as to link protein identifications to a given spot on the pattern. The website presents all of the submitted 2D gel patterns where the end-user can dynamically display the images or parts of images along with molecular weight, pH profile information and linked protein identification. A collection of images can be selected for the creation of animations from which the user can select sub-regions of interest and unlimited 2D gel patterns for visualization. The website currently presents 233 identifications for 81 gel patterns for Homo sapiens, Methanococcus jannaschii, Pyro coccus furiosus, Shewanella oneidensis, Escherichia coli and Deinococcus radiodurans.

  5. Annotating nonspecific SAGE tags with microarray data.

    PubMed

    Ge, Xijin; Jung, Yong-Chul; Wu, Qingfa; Kibbe, Warren A; Wang, San Ming

    2006-01-01

    SAGE (serial analysis of gene expression) detects transcripts by extracting short tags from the transcripts. Because of the limited length, many SAGE tags are shared by transcripts from different genes. Relying on sequence information in the general gene expression database has limited power to solve this problem due to the highly heterogeneous nature of the deposited sequences. Considering that the complexity of gene expression at a single tissue level should be much simpler than that in the general expression database, we reasoned that by restricting gene expression to tissue level, the accuracy of gene annotation for the nonspecific SAGE tags should be significantly improved. To test the idea, we developed a tissue-specific SAGE annotation database based on microarray data (). This database contains microarray expression information represented as UniGene clusters for 73 normal human tissues and 18 cancer tissues and cell lines. The nonspecific SAGE tag is first matched to the database by the same tissue type used by both SAGE and microarray analysis; then the multiple UniGene clusters assigned to the nonspecific SAGE tag are searched in the database under the matched tissue type. The UniGene cluster presented solely or at higher expression levels in the database is annotated to represent the specific gene for the nonspecific SAGE tags. The accuracy of gene annotation by this database was largely confirmed by experimental data. Our study shows that microarray data provide a useful source for annotating the nonspecific SAGE tags.

  6. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database.

    PubMed

    Engel, Stacia R; Cherry, J Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery.

  7. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  8. EnzDP: improved enzyme annotation for metabolic network reconstruction based on domain composition profiles.

    PubMed

    Nguyen, Nam-Ninh; Srihari, Sriganesh; Leong, Hon Wai; Chong, Ket-Fah

    2015-10-01

    Determining the entire complement of enzymes and their enzymatic functions is a fundamental step for reconstructing the metabolic network of cells. High quality enzyme annotation helps in enhancing metabolic networks reconstructed from the genome, especially by reducing gaps and increasing the enzyme coverage. Currently, structure-based and network-based approaches can only cover a limited number of enzyme families, and the accuracy of homology-based approaches can be further improved. Bottom-up homology-based approach improves the coverage by rebuilding Hidden Markov Model (HMM) profiles for all known enzymes. However, its clustering procedure relies firmly on BLAST similarity score, ignoring protein domains/patterns, and is sensitive to changes in cut-off thresholds. Here, we use functional domain architecture to score the association between domain families and enzyme families (Domain-Enzyme Association Scoring, DEAS). The DEAS score is used to calculate the similarity between proteins, which is then used in clustering procedure, instead of using sequence similarity score. We improve the enzyme annotation protocol using a stringent classification procedure, and by choosing optimal threshold settings and checking for active sites. Our analysis shows that our stringent protocol EnzDP can cover up to 90% of enzyme families available in Swiss-Prot. It achieves a high accuracy of 94.5% based on five-fold cross-validation. EnzDP outperforms existing methods across several testing scenarios. Thus, EnzDP serves as a reliable automated tool for enzyme annotation and metabolic network reconstruction. Available at: www.comp.nus.edu.sg/~nguyennn/EnzDP .

  9. KEGG orthology-based annotation of the predicted proteome of Acropora digitifera: ZoophyteBase - an open access and searchable database of a coral genome

    PubMed Central

    2013-01-01

    Background Contemporary coral reef research has firmly established that a genomic approach is urgently needed to better understand the effects of anthropogenic environmental stress and global climate change on coral holobiont interactions. Here we present KEGG orthology-based annotation of the complete genome sequence of the scleractinian coral Acropora digitifera and provide the first comprehensive view of the genome of a reef-building coral by applying advanced bioinformatics. Description Sequences from the KEGG database of protein function were used to construct hidden Markov models. These models were used to search the predicted proteome of A. digitifera to establish complete genomic annotation. The annotated dataset is published in ZoophyteBase, an open access format with different options for searching the data. A particularly useful feature is the ability to use a Google-like search engine that links query words to protein attributes. We present features of the annotation that underpin the molecular structure of key processes of coral physiology that include (1) regulatory proteins of symbiosis, (2) planula and early developmental proteins, (3) neural messengers, receptors and sensory proteins, (4) calcification and Ca2+-signalling proteins, (5) plant-derived proteins, (6) proteins of nitrogen metabolism, (7) DNA repair proteins, (8) stress response proteins, (9) antioxidant and redox-protective proteins, (10) proteins of cellular apoptosis, (11) microbial symbioses and pathogenicity proteins, (12) proteins of viral pathogenicity, (13) toxins and venom, (14) proteins of the chemical defensome and (15) coral epigenetics. Conclusions We advocate that providing annotation in an open-access searchable database available to the public domain will give an unprecedented foundation to interrogate the fundamental molecular structure and interactions of coral symbiosis and allow critical questions to be addressed at the genomic level based on combined aspects of

  10. A multi-label image annotation scheme based on improved SVM multiple kernel learning

    NASA Astrophysics Data System (ADS)

    Jin, Cong; Jin, Shu-Wei

    2017-02-01

    Multi-label image annotation (MIA) has been widely studied during recent years and many MIA schemes have been proposed. However, the most existing schemes are not satisfactory. In this paper, an improved multiple kernel learning (IMKL) method of support vector machine (SVM) is proposed to improve the classification accuracy of SVM, then a novel MIA scheme based on IMKL is presented, which uses the discriminant loss to control the number of top semantic labels, and the feature selection approach is also used for improving the performance of MIA. The experiment results show that proposed MIA scheme achieves higher the performance than the existing other MIA schemes, its performance is satisfactory for large image dataset.

  11. NDER: A novel web application using annotated whole slide images for rapid improvements in human pattern recognition

    PubMed Central

    Reder, Nicholas P.; Glasser, Daniel; Dintzis, Suzanne M.; Rendi, Mara H.; Garcia, Rochelle L.; Henriksen, Jonathan C.; Kilgore, Mark R.

    2016-01-01

    Context: Whole-slide images (WSIs) present a rich source of information for education, training, and quality assurance. However, they are often used in a fashion similar to glass slides rather than in novel ways that leverage the advantages of WSI. We have created a pipeline to transform annotated WSI into pattern recognition training, and quality assurance web application called novel diagnostic electronic resource (NDER). Aims: Create an efficient workflow for extracting annotated WSI for use by NDER, an attractive web application that provides high-throughput training. Materials and Methods: WSI were annotated by a resident and classified into five categories. Two methods of extracting images and creating image databases were compared. Extraction Method 1: Manual extraction of still images and validation of each image by four breast pathologists. Extraction Method 2: Validation of annotated regions on the WSI by a single experienced breast pathologist and automated extraction of still images tagged by diagnosis. The extracted still images were used by NDER. NDER briefly displays an image, requires users to classify the image after time has expired, then gives users immediate feedback. Results: The NDER workflow is efficient: annotation of a WSI requires 5 min and validation by an expert pathologist requires An additional one to 2 min. The pipeline is highly automated, with only annotation and validation requiring human input. NDER effectively displays hundreds of high-quality, high-resolution images and provides immediate feedback to users during a 30 min session. Conclusions: NDER efficiently uses annotated WSI to rapidly increase pattern recognition and evaluate for diagnostic proficiency. PMID:27563490

  12. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  13. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    PubMed

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  14. Annotated Bibliography: Understanding Ambulatory Care Practices in the Context of Patient Safety and Quality Improvement.

    PubMed

    Montano, Maria F; Mehdi, Harshal; Nash, David B

    2016-11-01

    The ambulatory care setting is an increasingly important component of the patient safety conversation. Inpatient safety is the primary focus of the vast majority of safety research and interventions, but the ambulatory setting is actually where most medical care is administered. Recent attention has shifted toward examining ambulatory care in order to implement better health care quality and safety practices. This annotated bibliography was created to analyze and augment the current literature on ambulatory care practices with regard to patient safety and quality improvement. By providing a thorough examination of current practices, potential improvement strategies in ambulatory care health care settings can be suggested. A better understanding of the myriad factors that influence delivery of patient care will catalyze future health care system development and implementation in the ambulatory setting.

  15. A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation

    PubMed Central

    Ogris, Christoph; Guala, Dimitri; Helleday, Thomas; Sonnhammer, Erik L. L.

    2017-01-01

    Analyzing gene expression patterns is a mainstay to gain functional insights of biological systems. A plethora of tools exist to identify significant enrichment of pathways for a set of differentially expressed genes. Most tools analyze gene overlap between gene sets and are therefore severely hampered by the current state of pathway annotation, yet at the same time they run a high risk of false assignments. A way to improve both true positive and false positive rates (FPRs) is to use a functional association network and instead look for enrichment of network connections between gene sets. We present a new network crosstalk analysis method BinoX that determines the statistical significance of network link enrichment or depletion between gene sets, using the binomial distribution. This is a much more appropriate statistical model than previous methods have employed, and as a result BinoX yields substantially better true positive and FPRs than was possible before. A number of benchmarks were performed to assess the accuracy of BinoX and competing methods. We demonstrate examples of how BinoX finds many biologically meaningful pathway annotations for gene sets from cancer and other diseases, which are not found by other methods. BinoX is available at http://sonnhammer.org/BinoX. PMID:27664219

  16. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database

    PubMed Central

    Engel, Stacia R.; Cherry, J. Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186

  17. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs.

    PubMed

    Liu, Xiaoming; Wu, Chunlei; Li, Chang; Boerwinkle, Eric

    2016-03-01

    The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study. A list of all potential nsSNVs and ssSNVs based on the human reference sequence were created and functional predictions and annotations were curated and compiled for each SNV. Here, we report a recent major update of the database to version 3.0. The SNV list has been rebuilt based on GENCODE 22 and currently the database includes 82,832,027 nsSNVs and ssSNVs. An attached database dbscSNV, which compiled all potential human SNVs within splicing consensus regions and their deleteriousness predictions, add another 15,030,459 potentially functional SNVs. Eleven prediction scores (MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4× fitCons, fathmm-MKL, and DANN) and allele frequencies from the UK10K cohorts and the Exome Aggregation Consortium (ExAC), among others, have been added. The original seven prediction scores in v2.0 (SIFT, 2× Polyphen2, LRT, MutationTaster, MutationAssessor, and FATHMM) as well as many SNV and gene functional annotations have been updated. dbNSFP v3.0 is freely available at http://sites.google.com/site/jpopgen/dbNSFP.

  18. A semantic analysis of the annotations of the human genome

    PubMed Central

    Khatri, Purvesh; Done, Bogdan; Rao, Archana; Done, Arina

    2008-01-01

    The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene–function assignments. A subsequent literature search found that 138 of these gene–functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene–function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database. Availability http://vortex.cs.wayne.edu/papers/semantic_analysis_bioinfo.pdf Contact sod@cs.wayne.edu PMID:15955782

  19. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.

  20. Improving genome annotation of enterotoxigenic Escherichia coli TW10598 by a label-free quantitative MS/MS approach.

    PubMed

    Pettersen, Veronika Kuchařová; Steinsland, Hans; Wiker, Harald G

    2015-11-01

    The most commonly used genome annotation processes are to a great extent based on computational methods. However, those can only predict genes that have been described earlier or that have sequence signatures indicative of a gene function. Here, we report a synonymous proteogenomic approach for experimentally improving microbial genome annotation based on label-free quantitative MS/MS. The approach is exemplified by analysis of cell extracts from in vitro cultured enterotoxigenic Escherichia coli (ETEC) strain TW10598, as part of an effort to create a new reference ETEC genome sequence. The proteomic analysis yielded identification of 2060 proteins, out of which 312 proteins were originally described as hypothetical. For 84% of the identified proteins we have provided description of their relative quantitative levels, among others, for 20 abundantly expressed ETEC virulence factors. Proteogenomic mapping supported the existence of four protein-coding genes that had not been annotated, and led to correction of translation start positions of another nine. The addition of the proteomic analysis into TW10598 genome re-annotation project improved quality of the annotation, and provided experimental evidence for a significant portion of ETEC expressed proteome. Data are available via ProteomeXchange with identifier PXD002473 (http://proteomecentral.proteomexchange.org/dataset/PXD002473).

  1. Perceived Usefulness of a Strategy-Based Peer Annotation System for Improving Academic Reading Comprehension

    ERIC Educational Resources Information Center

    Chen, I-Jung; Chen, Wen-Chun

    2016-01-01

    This study examines the enhancing effect of peer annotation on the academic English reading of nonnative-Englishspeaking graduate students. To facilitate peer collaboration, the present study included the development of a strategybased online reading system. Through peer annotation, the students not only achieved enhanced reading comprehension but…

  2. Correction of the Caulobacter crescentus NA1000 genome annotation.

    PubMed

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  3. Development of the Community Health Improvement Navigator Database of Interventions

    PubMed Central

    Roy, Brita; Stanojevich, Joel; Stange, Paul; Jiwani, Nafisa; King, Raymond; Koo, Denise

    2016-01-01

    Summary With the passage of the Patient Protection and Affordable Care Act, the requirements for hospitals to achieve tax-exempt status include performing a triennial community health needs assessment and developing a plan to address identified needs. To address community health needs, multisector collaborative efforts to improve both health care and non–health care determinants of health outcomes have been the most effective and sustainable. In 2015, CDC released the Community Health Improvement Navigator to facilitate the development of these efforts. This report describes the development of the database of interventions included in the Community Health Improvement Navigator. The database of interventions allows the user to easily search for multisector, collaborative, evidence-based interventions to address the underlying causes of the greatest morbidity and mortality in the United States: tobacco use and exposure, physical inactivity, unhealthy diet, high cholesterol, high blood pressure, diabetes, and obesity. PMID:26917110

  4. Improved genome annotation through untargeted detection of pathway-specific metabolites

    PubMed Central

    2011-01-01

    Background Mass spectrometry-based metabolomics analyses have the potential to complement sequence-based methods of genome annotation, but only if raw mass spectral data can be linked to specific metabolic pathways. In untargeted metabolomics, the measured mass of a detected compound is used to define the location of the compound in chemical space, but uncertainties in mass measurements lead to "degeneracies" in chemical space since multiple chemical formulae correspond to the same measured mass. We compare two methods to eliminate these degeneracies. One method relies on natural isotopic abundances, and the other relies on the use of stable-isotope labeling (SIL) to directly determine C and N atom counts. Both depend on combinatorial explorations of the "chemical space" comprised of all possible chemical formulae comprised of biologically relevant chemical elements. Results Of 1532 metabolic pathways curated in the MetaCyc database, 412 contain a metabolite having a chemical formula unique to that metabolic pathway. Thus, chemical formulae alone can suffice to infer the presence of some metabolic pathways. Of 248,928 unique chemical formulae selected from the PubChem database, more than 95% had at least one degeneracy on the basis of accurate mass information alone. Consideration of natural isotopic abundance reduced degeneracy to 64%, but mainly for formulae less than 500 Da in molecular weight, and only if the error in the relative isotopic peak intensity was less than 10%. Knowledge of exact C and N atom counts as determined by SIL enabled reduced degeneracy, allowing for determination of unique chemical formula for 55% of the PubChem formulae. Conclusions To facilitate the assignment of chemical formulae to unknown mass-spectral features, profiling can be performed on cultures uniformly labeled with stable isotopes of nitrogen (15N) or carbon (13C). This makes it possible to accurately count the number of carbon and nitrogen atoms in each molecule, providing a

  5. Bacterial genome annotation.

    PubMed

    Beckloff, Nicholas; Starkenburg, Shawn; Freitas, Tracey; Chain, Patrick

    2012-01-01

    Annotation of prokaryotic sequences can be separated into structural and functional annotation. Structural annotation is dependent on algorithmic interrogation of experimental evidence to discover the physical characteristics of a gene. This is done in an effort to construct accurate gene models, so understanding function or evolution of genes among organisms is not impeded. Functional annotation is dependent on sequence similarity to other known genes or proteins in an effort to assess the function of the gene. Combining structural and functional annotation across genomes in a comparative manner promotes higher levels of accurate annotation as well as an advanced understanding of genome evolution. As the availability of bacterial sequences increases and annotation methods improve, the value of comparative annotation will increase.

  6. Towards computational improvement of DNA database indexing and short DNA query searching.

    PubMed

    Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska

    2014-09-03

    In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.

  7. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  8. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    PubMed Central

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-01-01

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations. PMID:19293273

  9. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  10. A Coding System with Independent Annotations of Gesture Forms and Functions during Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE).

    PubMed

    Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian

    2015-03-01

    Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant's linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance.

  11. A Coding System with Independent Annotations of Gesture Forms and Functions during Verbal Communication: Development of a Database of Speech and GEsture (DoSaGE)

    PubMed Central

    Kong, Anthony Pak-Hin; Law, Sam-Po; Kwan, Connie Ching-Yin; Lai, Christy; Lam, Vivian

    2014-01-01

    Gestures are commonly used together with spoken language in human communication. One major limitation of gesture investigations in the existing literature lies in the fact that the coding of forms and functions of gestures has not been clearly differentiated. This paper first described a recently developed Database of Speech and GEsture (DoSaGE) based on independent annotation of gesture forms and functions among 119 neurologically unimpaired right-handed native speakers of Cantonese (divided into three age and two education levels), and presented findings of an investigation examining how gesture use was related to age and linguistic performance. Consideration of these two factors, for which normative data are currently very limited or lacking in the literature, is relevant and necessary when one evaluates gesture employment among individuals with and without language impairment. Three speech tasks, including monologue of a personally important event, sequential description, and story-telling, were used for elicitation. The EUDICO Linguistic ANnotator (ELAN) software was used to independently annotate each participant’s linguistic information of the transcript, forms of gestures used, and the function for each gesture. About one-third of the subjects did not use any co-verbal gestures. While the majority of gestures were non-content-carrying, which functioned mainly for reinforcing speech intonation or controlling speech flow, the content-carrying ones were used to enhance speech content. Furthermore, individuals who are younger or linguistically more proficient tended to use fewer gestures, suggesting that normal speakers gesture differently as a function of age and linguistic performance. PMID:25667563

  12. Evaluating Computational Gene Ontology Annotations.

    PubMed

    Škunca, Nives; Roberts, Richard J; Steffen, Martin

    2017-01-01

    Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.

  13. CIPRO 2.5: Ciona intestinalis protein database, a unique integrated repository of large-scale omics data, bioinformatic analyses and curated annotation, with user rating and reviewing functionality.

    PubMed

    Endo, Toshinori; Ueno, Keisuke; Yonezawa, Kouki; Mineta, Katsuhiko; Hotta, Kohji; Satou, Yutaka; Yamada, Lixy; Ogasawara, Michio; Takahashi, Hiroki; Nakajima, Ayako; Nakachi, Mia; Nomura, Mamoru; Yaguchi, Junko; Sasakura, Yasunori; Yamasaki, Chisato; Sera, Miho; Yoshizawa, Akiyasu C; Imanishi, Tadashi; Taniguchi, Hisaaki; Inaba, Kazuo

    2011-01-01

    The Ciona intestinalis protein database (CIPRO) is an integrated protein database for the tunicate species C. intestinalis. The database is unique in two respects: first, because of its phylogenetic position, Ciona is suitable model for understanding vertebrate evolution; and second, the database includes original large-scale transcriptomic and proteomic data. Ciona intestinalis has also been a favorite of developmental biologists. Therefore, large amounts of data exist on its development and morphology, along with a recent genome sequence and gene expression data. The CIPRO database is aimed at collecting those published data as well as providing unique information from unpublished experimental data, such as 3D expression profiling, 2D-PAGE and mass spectrometry-based large-scale analyses at various developmental stages, curated annotation data and various bioinformatic data, to facilitate research in diverse areas, including developmental, comparative and evolutionary biology. For medical and evolutionary research, homologs in humans and major model organisms are intentionally included. The current database is based on a recently developed KH model containing 36,034 unique sequences, but for higher usability it covers 89,683 all known and predicted proteins from all gene models for this species. Of these sequences, more than 10,000 proteins have been manually annotated. Furthermore, to establish a community-supported protein database, these annotations are open to evaluation by users through the CIPRO website. CIPRO 2.5 is freely accessible at http://cipro.ibio.jp/2.5.

  14. The Plant Ontology Database: A community resource for plant structure and developmental stages controlled vocabulary and annotations

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aim to create, maintain, and facilitate the use of a controlled vocabulary(ontology) for plants. The ontology allows users to ascribe attributes o...

  15. The Plant Ontology Database: A Community Resource for Plant Structure and Developmental Stages Controlled Vocabulary and Annotations

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary(ontology) for plants. The ontology allows users to ascribe attributes o...

  16. ERIC Digests: An Annotated Bibliography of All ERIC Digests Announced in the ERIC Database through March 1993.

    ERIC Educational Resources Information Center

    Pugh, Elizabeth, Comp.; Brandhorst, Ted, Comp.

    Digests prepared by clearinghouses of the Educational Resources Information Center (ERIC) began to emerge seriously in 1984 as one of the most desirable and popular products of the various clearinghouses. Although ERIC Digests were not initially listed in the ERIC database because of their brevity, today they are entered routinely. This…

  17. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    PubMed Central

    Loevenich, Sandra N; Brunner, Erich; King, Nichole L; Deutsch, Eric W; Stein, Stephen E; Aebersold, Ruedi; Hafen, Ernst

    2009-01-01

    Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s) in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1) reduction of the complexity inherently associated with performing targeted proteomic studies, (2) designing and accelerating shotgun proteomics experiments, (3) confirming or questioning gene models, and (4) adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert. PMID:19210778

  18. Expression of cinnamyl alcohol dehydrogenases and their putative homologues during Arabidopsis thaliana growth and development: lessons for database annotations?

    PubMed

    Kim, Sung-Jin; Kim, Kye-Won; Cho, Man-Ho; Franceschi, Vincent R; Davin, Laurence B; Lewis, Norman G

    2007-07-01

    A major goal currently in Arabidopsis research is determination of the (biochemical) function of each of its approximately 27,000 genes. To date, however, 12% of its genes actually have known biochemical roles. In this study, we considered it instructive to identify the gene expression patterns of nine (so-called AtCAD1-9) of 17 genes originally annotated by The Arabidopsis Information Resource (TAIR) as cinnamyl alcohol dehydrogenase (CAD, EC 1.1.1.195) homologues [see Costa, M.A., Collins, R.E., Anterola, A.M., Cochrane, F.C., Davin, L.B., Lewis N.G., 2003. An in silico assessment of gene function and organization of the phenylpropanoid pathway metabolic networks in Arabidopsis thaliana and limitations thereof. Phytochemistry 64, 1097-1112.]. In agreement with our biochemical studies in vitro [Kim, S.-J., Kim, M.-R., Bedgar, D.L., Moinuddin, S.G.A., Cardenas, C.L., Davin, L.B., Kang, C.-H., Lewis, N.G., 2004. Functional reclassification of the putative cinnamyl alcohol dehydrogenase multigene family in Arabidopsis. Proc. Natl. Acad. Sci. USA 101, 1455-1460.], and analysis of a double mutant [Sibout, R., Eudes, A., Mouille, G., Pollet, B., Lapierre, C., Jouanin, L., Séguin A., 2005. Cinnamyl Alcohol Dehydrogenase-C and -D are the primary genes involved in lignin biosynthesis in the floral stem of Arabidopsis. Plant Cell 17, 2059-2076.], both AtCAD5 (At4g34230) and AtCAD4 (At3g19450) were found to have expression patterns consistent with development/formation of different forms of the lignified vascular apparatus, e.g. lignifying stem tissues, bases of trichomes, hydathodes, abscission zones of siliques, etc. Expression was also observed in various non-lignifying zones (e.g. root caps) indicative of, perhaps, a role in plant defense. In addition, expression patterns of the four CAD-like homologues were investigated, i.e. AtCAD2 (At2g21730), AtCAD3 (At2g21890), AtCAD7 (At4g37980) and AtCAD8 (At4g37990), each of which previously had been demonstrated to have low CAD

  19. Aviation and the Environment. A Selected, Annotated Bibliography Related to Aviation's Responses Toward Improving the Environment.

    ERIC Educational Resources Information Center

    Marshall, Jane

    This informal, brief bibliography attempts to stress the positive side of aviation, annotating documents that explain how the airlines, aircraft engine manufacturers, government agencies, military aviation, and general aviation are meeting their responsibilities in solving environmental problems. Topics arousing public concern are identified:…

  20. Hedgehog Signaling Pathway Database: a repository of current annotation efforts and resources for the Hh research community.

    PubMed

    Hervold, Kieran; Martin, Andrew; Kirkpatrick, Roger A; Mc Kenna, Paul F; Ramirez-Weber, F A

    2007-01-01

    The Hedgehog Signaling Pathway Database is a curated repository of information pertaining to the Hedgehog developmental pathway. It was designed to provide centralized access to a wide range of relevant information in an organism-agnostic manner. Data are provided for all genes and gene targets known to be involved in the Hh pathway across various organisms. The data provided include DNA and protein sequences as well as domain structure motifs. All known human diseases associated with the Hh pathway are indexed including experimental data on therapeutic agents and their molecular targets. Hh researchers will find useful information on relevant protocols, tissue cell lines and reagents used in current Hh research projects. Curated content is also provided for publications, grants and patents relating to the Hh pathway. The database can be accessed at http://www.hedgehog.sfsu.edu.

  1. Improving knowledge management through the support of image examination and data annotation using DICOM structured reporting.

    PubMed

    Torres, José Salavert; Damian Segrelles Quilis, J; Espert, Ignacio Blanquer; García, Vicente Hernandez

    2012-12-01

    An important effort has been invested on improving the image diagnosis process in different medical areas using information technologies. The field of medical imaging involves two main data types: medical imaging and reports. Developments based on the DICOM standard have demonstrated to be a convenient and widespread solution among the medical community. The main objective of this work is to design a Web application prototype that will be able to improve diagnosis and follow-on of breast cancer patients. It is based on TRENCADIS middleware, which provides a knowledge-oriented storage model composed by federated repositories of DICOM image studies and DICOM-SR medical reports. The full structure and contents of the diagnosis reports are used as metadata for indexing images. The TRENCADIS infrastructure takes full advantage of Grid technologies by deploying multi-resource grid services that enable multiple views (reports schemes) of the knowledge database. The paper presents a real deployment of such Web application prototype in the Dr. Peset Hospital providing radiologists with a tool to create, store and search diagnostic reports based on breast cancer explorations (mammography, magnetic resonance, ultrasound, pre-surgery biopsy and post-surgery biopsy), improving support for diagnostics decisions. A technical details for use cases (outlining enhanced multi-resource grid services communication and processing steps) and interactions between actors and the deployed prototype are described. As a result, information is more structured, the logic is clearer, network messages have been reduced and, in general, the system is more resistant to failures.

  2. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis.

    PubMed

    Blohm, Philipp; Frishman, Goar; Smialowski, Pawel; Goebels, Florian; Wachinger, Benedikt; Ruepp, Andreas; Frishman, Dmitrij

    2014-01-01

    Knowledge about non-interacting proteins (NIPs) is important for training the algorithms to predict protein-protein interactions (PPIs) and for assessing the false positive rates of PPI detection efforts. We present the second version of Negatome, a database of proteins and protein domains that are unlikely to engage in physical interactions (available online at http://mips.helmholtz-muenchen.de/proj/ppi/negatome). Negatome is derived by manual curation of literature and by analyzing three-dimensional structures of protein complexes. The main methodological innovation in Negatome 2.0 is the utilization of an advanced text mining procedure to guide the manual annotation process. Potential non-interactions were identified by a modified version of Excerbt, a text mining tool based on semantic sentence analysis. Manual verification shows that nearly a half of the text mining results with the highest confidence values correspond to NIP pairs. Compared to the first version the contents of the database have grown by over 300%.

  3. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG

  4. DSSTOX STRUCTURE-SEARCHABLE PUBLIC TOXICITY DATABASE NETWORK: CURRENT PROGRESS AND NEW INITIATIVES TO IMPROVE CHEMO-BIOINFORMATICS CAPABILITIES

    EPA Science Inventory

    The EPA DSSTox website (http://www/epa.gov/nheerl/dsstox) publishes standardized, structure-annotated toxicity databases, covering a broad range of toxicity disciplines. Each DSSTox database features documentation written in collaboration with the source authors and toxicity expe...

  5. Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.

    PubMed

    Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan

    2017-03-15

    The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation.

  6. Genic and Intergenic SSR Database Generation, SNPs Determination and Pathway Annotations, in Date Palm (Phoenix dactylifera L.)

    PubMed Central

    2016-01-01

    The present investigation was carried out aiming to use the bioinformatics tools in order to identify and characterize, simple sequence repeats within the third Version of the date palm genome and develop a new SSR primers database. In addition single nucleotide polymorphisms (SNPs) that are located within the SSR flanking regions were recognized. Moreover, the pathways for the sequences assigned by SSR primers, the biological functions and gene interaction were determined. A total of 172,075 SSR motifs was identified on date palm genome sequence with a frequency of 450.97 SSRs per Mb. Out of these, 130,014 SSRs (75.6%) were located within the intergenic regions with a frequency of 499 SSRs per Mb. While, only 42,061 SSRs (24.4%) were located within the genic regions with a frequency of 347.5 SSRs per Mb. A total of 111,403 of SSR primer pairs were designed, that represents 291.9 SSR primers per Mb. Out of the 111,403, only 31,380 SSR primers were in the genic regions, while 80,023 primers were in the intergenic regions. A number of 250,507 SNPs were recognized in 84,172 SSR flanking regions, which represents 75.55% of the total SSR flanking regions. Out of 12,274 genes only 463 genes comprising 896 SSR primers were mapped onto 111 pathways using KEGG data base. The most abundant enzymes were identified in the pathway related to the biosynthesis of antibiotics. We tested 1031 SSR primers using both publicly available date palm genome sequences as templates in the in silico PCR reactions. Concerning in vitro validation, 31 SSR primers among those used in the in silico PCR were synthesized and tested for their ability to detect polymorphism among six Egyptian date palm cultivars. All tested primers have successfully amplified products, but only 18 primers detected polymorphic amplicons among the studied date palm cultivars. PMID:27434138

  7. U.S. EPA computational toxicology programs: Central role of chemical-annotation efforts and molecular databases

    EPA Science Inventory

    EPA’s National Center for Computational Toxicology is engaged in high-profile research efforts to improve the ability to more efficiently and effectively prioritize and screen thousands of environmental chemicals for potential toxicity. A central component of these efforts invol...

  8. Novel scripts for improved annotation and selection of variants from whole exome sequencing in cancer research

    PubMed Central

    Hansen, Marcus Celik; Nederby, Line; Roug, Anne; Villesen, Palle; Kjeldsen, Eigil; Nyvold, Charlotte Guldborg; Hokland, Peter

    2015-01-01

    Sequencing the exome is quickly becoming the preferred method for discovering disease-inducing mutations. While obtaining data sets is a straightforward procedure, the subsequent analysis and interpretation of the data is a limiting step for clinical applications. Thus, while the initial mutation and variant calling can be performed by a bioinformatician or trained researcher, the output from robust packages such as MuTect and GATK is not directly informative for the general life scientists. In attempt to obviate this problem we have created complementary Wolfram scripts, which enable easy downstream annotation and selection, presented here in the perspective of hematological relevance. It also provides the researcher with the opportunity to extend the analysis by having a full-fledged programming and analysis environment of Mathematica at hand. In brief, post-processing is performed by: • Mapping of germ line and somatic variants to coding regions, and defining variant sets within Mathematica. • Processing of variants in variant effect predictor. • Extended annotation, relevance scoring and defining focus areas through the provided functions. PMID:26150983

  9. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcine P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-03-17

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes.

  10. Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm

    PubMed Central

    Kind, Tobias; Fiehn, Oliver

    2006-01-01

    Background Metabolomic studies are targeted at identifying and quantifying all metabolites in a given biological context. Among the tools used for metabolomic research, mass spectrometry is one of the most powerful tools. However, metabolomics by mass spectrometry always reveals a high number of unknown compounds which complicate in depth mechanistic or biochemical understanding. In principle, mass spectrometry can be utilized within strategies of de novo structure elucidation of small molecules, starting with the computation of the elemental composition of an unknown metabolite using accurate masses with errors <5 ppm (parts per million). However even with very high mass accuracy (<1 ppm) many chemically possible formulae are obtained in higher mass regions. In automatic routines an additional orthogonal filter therefore needs to be applied in order to reduce the number of potential elemental compositions. This report demonstrates the necessity of isotope abundance information by mathematical confirmation of the concept. Results High mass accuracy (<1 ppm) alone is not enough to exclude enough candidates with complex elemental compositions (C, H, N, S, O, P, and potentially F, Cl, Br and Si). Use of isotopic abundance patterns as a single further constraint removes >95% of false candidates. This orthogonal filter can condense several thousand candidates down to only a small number of molecular formulas. Example calculations for 10, 5, 3, 1 and 0.1 ppm mass accuracy are given. Corresponding software scripts can be downloaded from . A comparison of eight chemical databases revealed that PubChem and the Dictionary of Natural Products can be recommended for automatic queries using molecular formulae. Conclusion More than 1.6 million molecular formulae in the range 0–500 Da were generated in an exhaustive manner under strict observation of mathematical and chemical rules. Assuming that ion species are fully resolved (either by chromatography or by high resolution

  11. The Ensembl gene annotation system

    PubMed Central

    Aken, Bronwen L.; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J.; Murphy, Daniel N.; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y. Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html PMID:27337980

  12. Improving Recall Using Database Management Systems: A Learning Strategy.

    ERIC Educational Resources Information Center

    Jonassen, David H.

    1986-01-01

    Describes the use of microcomputer database management systems to facilitate the instructional uses of learning strategies relating to information processing skills, especially recall. Two learning strategies, cross-classification matrixing and node acquisition and integration, are highlighted. (Author/LRW)

  13. Improvement of barley genome annotations by deciphering the Haruna Nijo genome.

    PubMed

    Sato, Kazuhiro; Tanaka, Tsuyoshi; Shigenobu, Shuji; Motoi, Yuka; Wu, Jianzhong; Itoh, Takeshi

    2016-02-01

    Full-length (FL) cDNA sequences provide the most reliable evidence for the presence of genes in genomes. In this report, detailed gene structures of barley, whole genome shotgun (WGS) and additional transcript data of the cultivar Haruna Nijo were quality controlled and compared with the published Morex genome information. Haruna Nijo scaffolds have longer total sequence length with much higher N50 and fewer sequences than those in Morex WGS contigs. The longer Haruna Nijo scaffolds provided efficient FLcDNA mapping, resulting in high coverage and detection of the transcription start sites. In combination with FLcDNAs and RNA-Seq data from four different tissue samples of Haruna Nijo, we identified 51,249 gene models on 30,606 loci. Overall sequence similarity between Haruna Nijo and Morex genome was 95.99%, while that of exon regions was higher (99.71%). These sequence and annotation data of Haruna Nijo are combined with Morex genome data and released from a genome browser. The genome sequence of Haruna Nijo may provide detailed gene structures in addition to the current Morex barley genome information.

  14. WOVOdat, A Worldwide Volcano Unrest Database, to Improve Eruption Forecasts

    NASA Astrophysics Data System (ADS)

    Widiwijayanti, C.; Costa, F.; Win, N. T. Z.; Tan, K.; Newhall, C. G.; Ratdomopurbo, A.

    2015-12-01

    WOVOdat is the World Organization of Volcano Observatories' Database of Volcanic Unrest. An international effort to develop common standards for compiling and storing data on volcanic unrests in a centralized database and freely web-accessible for reference during volcanic crises, comparative studies, and basic research on pre-eruption processes. WOVOdat will be to volcanology as an epidemiological database is to medicine. Despite the large spectrum of monitoring techniques, the interpretation of monitoring data throughout the evolution of the unrest and making timely forecasts remain the most challenging tasks for volcanologists. The field of eruption forecasting is becoming more quantitative, based on the understanding of the pre-eruptive magmatic processes and dynamic interaction between variables that are at play in a volcanic system. Such forecasts must also acknowledge and express the uncertainties, therefore most of current research in this field focused on the application of event tree analysis to reflect multiple possible scenarios and the probability of each scenario. Such forecasts are critically dependent on comprehensive and authoritative global volcano unrest data sets - the very information currently collected in WOVOdat. As the database becomes more complete, Boolean searches, side-by-side digital and thus scalable comparisons of unrest, pattern recognition, will generate reliable results. Statistical distribution obtained from WOVOdat can be then used to estimate the probabilities of each scenario after specific patterns of unrest. We established main web interface for data submission and visualizations, and have now incorporated ~20% of worldwide unrest data into the database, covering more than 100 eruptive episodes. In the upcoming years we will concentrate in acquiring data from volcano observatories develop a robust data query interface, optimizing data mining, and creating tools by which WOVOdat can be used for probabilistic eruption

  15. Improving data accuracy of commercial food outlet databases.

    PubMed

    Ohri-Vachaspati, Punam; Martinez, Diane; Yedidia, Michael J; Petlick, Nirvana

    2011-01-01

    PURPOSE. Assessing food environments often requires using commercially available data. Disparate methods used for classifying food outlets in these databases call for creating a classification approach using common definitions. A systematic strategy for reclassifying food stores and restaurants, as they appear in commercial databases, into categories that differentiate the availability of healthy options is described here. DESIGN AND SETTING. Commercially available data on food outlets including names, addresses, North American Industry Classification System codes, and associated characteristics was obtained for five New Jersey communities. ANALYSIS. A reclassification methodology was developed using criteria and definitions from the literature to categorize food outlets based on availability of healthy options. Information in the database was supplemented by systematic Internet and key word searches, and from phone calls to food outlets. RESULTS. The methodology resulted in 622 supermarket/grocery stores, 183 convenience stores, and 148 specialty stores in the original data to be reclassified into 58 supermarkets, 30 grocery stores, 692 convenience stores, and 115 specialty stores. Outlets from the original list of 1485 full-service restaurants and 506 limited-service restaurants were reclassified as 563 full-service restaurants and 1247 limited-service restaurants. Reclassification resulted in less than one-seventh the number of supermarkets and grocery stores, more than three times the number of convenience stores, and twice as many limited-service restaurants-a much less healthy profile than the one generated by using exclusively the commercial databases. CONCLUSION. An explicit and replicable strategy is proposed for reclassifying food outlets in commercial databases into categories that differentiate on the basis of healthy food availability. The intent is to contribute towards building a consensus among researchers on definitions used in public health

  16. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    SciTech Connect

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S; Cottingham, Robert W

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to

  17. Graph-based sequence annotation using a data integration approach.

    PubMed

    Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan

    2008-08-25

    The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

  18. Genix: a new online automated pipeline for bacterial genome annotation.

    PubMed

    Kremer, Frederico Schmitt; Eslabão, Marcus Redü; Dellagostin, Odir Antônio; Pinto, Luciano da Silva

    2016-12-01

    Next-generation sequencing has significantly reduced the cost of genome-sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high-quality results.

  19. The SUPERFAMILY database in 2004: additions and improvements.

    PubMed

    Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K; Chothia, Cyrus; Gough, Julian

    2004-01-01

    The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.

  20. Adaptation and Recommendation Techniques to Improve the Quality of Annotations and the Relevance of Resources in Web 2.0 and Semantic Web-Based Applications

    NASA Astrophysics Data System (ADS)

    Torre, Ilaria

    The Web 2.0 and the Semantic Web represent different forms of evolution of the first-generation Web, and both of them enrich Web resources with semantic annotations. Recommendation and personalization of Web resources is another trend that becomes more and more important with the growth of information, and both the Web 2.0 and the Semantic Web are deeply connected to it. The objective of this paper is to analyze the contribution of recommendation and adaptation techniques to these paradigms and to investigate if these techniques can be used as a bridge for their integration. More specifically, the paper will focus on the contribution of adaptation and recommendation techniques to improve the quality of annotations in the Web 2.0, Semantic Web, and mixed approaches and the relevance of annotated resources that are retrieved or filtered to users.

  1. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology.

    PubMed

    Gibson, Molly K; Forsberg, Kevin J; Dantas, Gautam

    2015-01-01

    Antibiotic resistance is a dire clinical problem with important ecological dimensions. While antibiotic resistance in human pathogens continues to rise at alarming rates, the impact of environmental resistance on human health is still unclear. To investigate the relationship between human-associated and environmental resistomes, we analyzed functional metagenomic selections for resistance against 18 clinically relevant antibiotics from soil and human gut microbiota as well as a set of multidrug-resistant cultured soil isolates. These analyses were enabled by Resfams, a new curated database of protein families and associated highly precise and accurate profile hidden Markov models, confirmed for antibiotic resistance function and organized by ontology. We demonstrate that the antibiotic resistance functions that give rise to the resistance profiles observed in environmental and human-associated microbial communities significantly differ between ecologies. Antibiotic resistance functions that most discriminate between ecologies provide resistance to β-lactams and tetracyclines, two of the most widely used classes of antibiotics in the clinic and agriculture. We also analyzed the antibiotic resistance gene composition of over 6000 sequenced microbial genomes, revealing significant enrichment of resistance functions by both ecology and phylogeny. Together, our results indicate that environmental and human-associated microbial communities harbor distinct resistance genes, suggesting that antibiotic resistance functions are largely constrained by ecology.

  2. Morphosyntactic annotation of CHILDES transcripts*

    PubMed Central

    SAGAE, KENJI; DAVIS, ERIC; LAVIE, ALON; MACWHINNEY, BRIAN; WINTNER, SHULY

    2014-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes. PMID:20334720

  3. DSSTOX WEBSITE LAUNCH: IMPROVING PUBLIC ACCESS TO DATABASES FOR BUILDING STRUCTURE-TOXICITY PREDICTION MODELS

    EPA Science Inventory

    DSSTox Website Launch: Improving Public Access to Databases for Building Structure-Toxicity Prediction Models
    Ann M. Richard
    US Environmental Protection Agency, Research Triangle Park, NC, USA

    Distributed: Decentralized set of standardized, field-delimited databases,...

  4. annot8r: GO, EC and KEGG annotation of EST datasets

    PubMed Central

    Schmid, Ralf; Blaxter, Mark L

    2008-01-01

    Background The expressed sequence tag (EST) methodology is an attractive option for the generation of sequence data for species for which no completely sequenced genome is available. The annotation and comparative analysis of such datasets poses a formidable challenge for research groups that do not have the bioinformatics infrastructure of major genome sequencing centres. Therefore, there is a need for user-friendly tools to facilitate the annotation of non-model species EST datasets with well-defined ontologies that enable meaningful cross-species comparisons. To address this, we have developed annot8r, a platform for the rapid annotation of EST datasets with GO-terms, EC-numbers and KEGG-pathways. Results annot8r automatically downloads all files relevant for the annotation process and generates a reference database that stores UniProt entries, their associated Gene Ontology (GO), Enzyme Commission (EC) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) annotation and additional relevant data. For each of GO, EC and KEGG, annot8r extracts a specific sequence subset from the UniProt dataset based on the information stored in the reference database. These three subsets are then formatted for BLAST searches. The user provides the protein or nucleotide sequences to be annotated and annot8r runs BLAST searches against these three subsets. The BLAST results are parsed and the corresponding annotations retrieved from the reference database. The annotations are saved both as flat files and also in a relational postgreSQL results database to facilitate more advanced searches within the results. annot8r is integrated with the PartiGene suite of EST analysis tools. Conclusion annot8r is a tool that assigns GO, EC and KEGG annotations for data sets resulting from EST sequencing projects both rapidly and efficiently. The benefits of an underlying relational database, flexibility and the ease of use of the program make it ideally suited for non-model species EST

  5. Improved structural annotation of protein-coding genes in the Meloidogyne hapla genome using RNA-Seq.

    PubMed

    Guo, Yuelong; Bird, David McK; Nielsen, Dahlia M

    2014-01-01

    As high-throughput cDNA sequencing (RNA-Seq) is increasingly applied to hypothesis-driven biological studies, the prediction of protein coding genes based on these data are usurping strictly in silico approaches. Compared with computationally derived gene predictions, structural annotation is more accurate when based on biological evidence, particularly RNA-Seq data. Here, we refine the current genome annotation for the Meloidogyne hapla genome utilizing RNA-Seq data. Published structural annotation defines 14 420 protein-coding genes in the M. hapla genome. Of these, 25% (3751) were found to exhibit some incongruence with RNA-Seq data. Manual annotation enabled these discrepancies to be resolved. Our analysis revealed 544 new gene models that were missing from the prior annotation. Additionally, 1457 transcribed regions were newly identified on the ends of as-yet-unjoined contigs. We also searched for trans-spliced leaders, and based on RNA-Seq data, identified genes that appear to be trans-spliced. Four 22-bp trans-spliced leaders were identified using our pipeline, including the known trans-spliced leader, which is the M. hapla ortholog of SL1. In silico predictions of trans-splicing were validated by comparison with earlier results derived from an independent cDNA library constructed to capture trans-spliced transcripts. The new annotation, which we term HapPep5, is publically available at www.hapla.org.

  6. Introduction to the STS National Database Series: Outcomes Analysis, Quality Improvement, and Patient Safety.

    PubMed

    Jacobs, Jeffrey P; Shahian, David M; Prager, Richard L; Edwards, Fred H; McDonald, Donna; Han, Jane M; D'Agostino, Richard S; Jacobs, Marshall L; Kozower, Benjamin D; Badhwar, Vinay; Thourani, Vinod H; Gaissert, Henning A; Fernandez, Felix G; Wright, Cam; Fann, James I; Paone, Gaetano; Sanchez, Juan A; Cleveland, Joseph C; Brennan, J Matthew; Dokholyan, Rachel S; O'Brien, Sean M; Peterson, Eric D; Grover, Frederick L; Patterson, G Alexander

    2015-12-01

    The Society of Thoracic Surgeons (STS) National Database is the foundation for most of the Society's quality, research, and patient safety activities. Beginning in January 2016 and repeating each year, The Annals of Thoracic Surgery will publish a monthly Database series of scholarly articles on outcomes analysis, quality improvement, and patient safety. Six articles will be directly derived from the STS National Database and will be published every other month: three articles on outcomes and quality (one each from the STS Adult Cardiac Surgery Database, the STS Congenital Heart Surgery Database, and the STS General Thoracic Surgery Database), and three articles on research (one from each of these three specialty databases). These six articles will alternate with five additional articles on topics related to patient safety. The final article, to be published in December, will provide a summary of the prior 11 manuscripts. This series will allow STS and its Workforces on National Databases, Research Development, and Patient Safety to convey timely information aimed at improving the quality and safety of cardiothoracic surgery.

  7. Annotated Videography.

    ERIC Educational Resources Information Center

    United States Holocaust Memorial Museum, Washington, DC.

    This annotated list of 43 videotapes recommended for classroom use addresses various themes for teaching about the Holocaust, including: (1) overviews of the Holocaust; (2) life before the Holocaust; (3) propaganda; (4) racism, anti-Semitism; (5) "enemies of the state"; (6) ghettos; (7) camps; (8) genocide; (9) rescue; (10) resistance;…

  8. KEGG as a reference resource for gene and protein annotation

    PubMed Central

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-01

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. PMID:26476454

  9. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-04

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de.

  10. An Experiment Using Document Annotations in Education.

    ERIC Educational Resources Information Center

    Lapique, Francis; Regev, Gil

    This paper describes an ongoing experiment at the Swiss Federal Institute of Technology in Lausanne regarding the use of World Wide Web-based document annotations for educational purposes. An annotation tool called Medium was built, merging database and Web technologies, to be used as a companion for classical university courses and collaborative…

  11. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation

    PubMed Central

    Kristensen, David M.; Wolf, Yuri I.; Koonin, Eugene V.

    2017-01-01

    The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of ‘index’ orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. PMID:28053163

  12. Improvements to the Magnetics Information Consortium (MagIC) Paleo and Rock Magnetic Database

    NASA Astrophysics Data System (ADS)

    Jarboe, N.; Minnett, R.; Tauxe, L.; Koppers, A. A. P.; Constable, C.; Jonestrask, L.

    2015-12-01

    The Magnetic Information Consortium (MagIC) database (http://earthref.org/MagIC/) continues to improve the ease of data uploading and editing, the creation of complex searches, data visualization, and data downloads for the paleomagnetic, geomagnetic, and rock magnetic communities. Online data editing is now available and the need for proprietary spreadsheet software is therefore entirely negated. The data owner can change values in the database or delete entries through an HTML 5 web interface that resembles typical spreadsheets in behavior and uses. Additive uploading now allows for additions to data sets to be uploaded with a simple drag and drop interface. Searching the database has improved with the addition of more sophisticated search parameters and with the facility to use them in complex combinations. A comprehensive summary view of a search result has been added for increased quick data comprehension while a raw data view is available if one desires to see all data columns as stored in the database. Data visualization plots (ARAI, equal area, demagnetization, Zijderveld, etc.) are presented with the data when appropriate to aid the user in understanding the dataset. MagIC data associated with individual contributions or from online searches may be downloaded in the tab delimited MagIC text file format for susbsequent offline use and analysis. With input from the paleomagnetic, geomagnetic, and rock magnetic communities, the MagIC database will continue to improve as a data warehouse and resource.

  13. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center

    PubMed Central

    Wattam, Alice R.; Davis, James J.; Assaf, Rida; Boisvert, Sébastien; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Dietrich, Emily M.; Disz, Terry; Gabbard, Joseph L.; Gerdes, Svetlana; Henry, Christopher S.; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K.; Olsen, Gary J.; Murphy-Olson, Daniel E.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Vonstein, Veronika; Warren, Andrew; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L.

    2017-01-01

    The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by ‘virtual integration’ to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics. PMID:27899627

  14. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

  15. TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

    PubMed Central

    Lorenzi, Hernan A.; Hoover, Jeff; Inman, Jason; Safford, Todd; Murphy, Sean; Kagan, Leonid; Williamson, Shannon J.

    2011-01-01

    In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence. PMID:21886867

  16. Towards a Library of Standard Operating Procedures (SOPs) for (meta)genomic annotation

    SciTech Connect

    Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen

    2008-04-01

    Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.

  17. Lessons learned while building the Deepwater Horizon Database: Toward improved data sharing in coastal science

    NASA Astrophysics Data System (ADS)

    Thessen, Anne E.; McGinnis, Sean; North, Elizabeth W.

    2016-02-01

    Process studies and coupled-model validation efforts in geosciences often require integration of multiple data types across time and space. For example, improved prediction of hydrocarbon fate and transport is an important societal need which fundamentally relies upon synthesis of oceanography and hydrocarbon chemistry. Yet, there are no publically accessible databases which integrate these diverse data types in a georeferenced format, nor are there guidelines for developing such a database. The objective of this research was to analyze the process of building one such database to provide baseline information on data sources and data sharing and to document the challenges and solutions that arose during this major undertaking. The resulting Deepwater Horizon Database was approximately 2.4 GB in size and contained over 8 million georeferenced data points collected from industry, government databases, volunteer networks, and individual researchers. The major technical challenges that were overcome were reconciliation of terms, units, and quality flags which were necessary to effectively integrate the disparate data sets. Assembling this database required the development of relationships with individual researchers and data managers which often involved extensive e-mail contacts. The average number of emails exchanged per data set was 7.8. Of the 95 relevant data sets that were discovered, 38 (40%) were obtained, either in whole or in part. Over one third (36%) of the requests for data went unanswered. The majority of responses were received after the first request (64%) and within the first week of the first request (67%). Although fewer than half of the potentially relevant datasets were incorporated into the database, the level of sharing (40%) was high compared to some other disciplines where sharing can be as low as 10%. Our suggestions for building integrated databases include budgeting significant time for e-mail exchanges, being cognizant of the cost versus

  18. UCSC Data Integrator and Variant Annotation Integrator

    PubMed Central

    Hinrichs, Angie S.; Raney, Brian J.; Speir, Matthew L.; Rhead, Brooke; Casper, Jonathan; Karolchik, Donna; Kuhn, Robert M.; Rosenbloom, Kate R.; Zweig, Ann S.; Haussler, David; Kent, W. James

    2016-01-01

    Summary: Two new tools on the UCSC Genome Browser web site provide improved ways of combining information from multiple datasets, optionally including the user's own custom track data and/or data from track hubs. The Data Integrator combines columns from multiple data tracks, showing all items from the first track along with overlapping items from the other tracks. The Variant Annotation Integrator is tailored to adding functional annotations to variant calls; it offers a more restricted set of underlying data tracks but adds predictions of each variant's consequences for any overlapping or nearby gene transcript. When available, it optionally adds additional annotations including effect prediction scores from dbNSFP for missense mutations, ENCODE regulatory summary tracks and conservation scores. Availability and implementation: The web tools are freely available at http://genome.ucsc.edu/ and the underlying database is available for download at http://hgdownload.cse.ucsc.edu/. The software (written in C and Javascript) is available from https://genome-store.ucsc.edu/ and is freely available for academic and non-profit usage; commercial users must obtain a license. Contact: angie@soe.ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740527

  19. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  20. Ranking biomedical annotations with annotator's semantic relevancy.

    PubMed

    Wu, Aihua

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.

  1. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  2. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation

    PubMed Central

    Yan, Wen-Jing; Li, Xiaobai; Wang, Su-Jing; Zhao, Guoying; Liu, Yong-Jin; Chen, Yu-Hsin; Fu, Xiaolan

    2014-01-01

    A robust automatic micro-expression recognition system would have broad applications in national safety, police interrogation, and clinical diagnosis. Developing such a system requires high quality databases with sufficient training samples which are currently not available. We reviewed the previously developed micro-expression databases and built an improved one (CASME II), with higher temporal resolution (200 fps) and spatial resolution (about 280×340 pixels on facial area). We elicited participants' facial expressions in a well-controlled laboratory environment and proper illumination (such as removing light flickering). Among nearly 3000 facial movements, 247 micro-expressions were selected for the database with action units (AUs) and emotions labeled. For baseline evaluation, LBP-TOP and SVM were employed respectively for feature extraction and classifier with the leave-one-subject-out cross-validation method. The best performance is 63.41% for 5-class classification. PMID:24475068

  3. Publicly Available Database : Improved Spectral Line Measurements In SDSS DR7 Galaxies

    NASA Astrophysics Data System (ADS)

    Oh, Kyuseok; Sarzi, M.; Schawinski, K.; Yi, S. K.

    2012-01-01

    We present a new database of absorption and emission line measurements based on the Sloan Digital Sky Survey 7th data release for the galaxies within a redshift of 0.2. Our work makes use of the publicly available penalized pixel-fitting(pPXF) and GANDALF codes, aiming to improve the existing measurements for stellar kinematics, the strength of various absorption-line features, and the flux and width of the emissions from different species of ionized gas. The absorption line strengths measured by SDSS pipeline are seriously contaminated by emission fill-in. We effectively separate emission lines from absorption lines. For instance, this work successfully extract [NI] doublet from Mgb and it leads to more realistic result of alpha enhancement on late-type galaxies compared to the previous database. Besides accurately measuring line strengths, the database provides new parameters that are indicative of line strength measurement quality. Users can build a subset of database optimal for their studies using specific cuts in the fitting quality parameters as well as empirical signal-to-noise. Applying these parameters, we found `hidden’ broad-line-region galaxies and they turned out to be Seyfert I nuclei that were not picked up as AGN by SDSS. The database is publicly available at http://gem.yonsei.ac.kr/ossy

  4. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    PubMed

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function.

  5. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.

  6. Improving the genome annotation of the acarbose producer Actinoplanes sp. SE50/110 by sequencing enriched 5'-ends of primary transcripts.

    PubMed

    Schwientek, Patrick; Neshat, Armin; Kalinowski, Jörn; Klein, Andreas; Rückert, Christian; Schneiker-Bekel, Susanne; Wendler, Sergej; Stoye, Jens; Pühler, Alfred

    2014-11-20

    Actinoplanes sp. SE50/110 is the producer of the alpha-glucosidase inhibitor acarbose, which is an economically relevant and potent drug in the treatment of type-2 diabetes mellitus. In this study, we present the detection of transcription start sites on this genome by sequencing enriched 5'-ends of primary transcripts. Altogether, 1427 putative transcription start sites were initially identified. With help of the annotated genome sequence, 661 transcription start sites were found to belong to the leader region of protein-coding genes with the surprising result that roughly 20% of these genes rank among the class of leaderless transcripts. Next, conserved promoter motifs were identified for protein-coding genes with and without leader sequences. The mapped transcription start sites were finally used to improve the annotation of the Actinoplanes sp. SE50/110 genome sequence. Concerning protein-coding genes, 41 translation start sites were corrected and 9 novel protein-coding genes could be identified. In addition to this, 122 previously undetermined non-coding RNA (ncRNA) genes of Actinoplanes sp. SE50/110 were defined. Focusing on antisense transcription start sites located within coding genes or their leader sequences, it was discovered that 96 of those ncRNA genes belong to the class of antisense RNA (asRNA) genes. The remaining 26 ncRNA genes were found outside of known protein-coding genes. Four chosen examples of prominent ncRNA genes, namely the transfer messenger RNA gene ssrA, the ribonuclease P class A RNA gene rnpB, the cobalamin riboswitch RNA gene cobRS, and the selenocysteine-specific tRNA gene selC, are presented in more detail. This study demonstrates that sequencing of enriched 5'-ends of primary transcripts and the identification of transcription start sites are valuable tools for advanced genome annotation of Actinoplanes sp. SE50/110 and most probably also for other bacteria.

  7. ToxGen: an improved reference database for the identification of type B-trichothecene genotypes in Fusarium

    PubMed Central

    2017-01-01

    Type B trichothecenes, which pose a serious hazard to consumer health, occur worldwide in grains. These mycotoxins are produced mainly by three different trichothecene genotypes/chemotypes: 3ADON (3-acetyldeoxynivalenol), 15ADON (15-acetyldeoxynivalenol) and NIV (nivalenol), named after these three major mycotoxin compounds. Correct identification of these genotypes is elementary for all studies relating to population surveys, fungal ecology and mycotoxicology. Trichothecene producers exhibit enormous strain-dependent chemical diversity, which may result in variation in levels of the genotype’s determining toxin and in the production of low to high amounts of atypical compounds. New high-throughput DNA-sequencing technologies promise to boost the diagnostics of mycotoxin genotypes. However, this requires a reference database containing a satisfactory taxonomic sampling of sequences showing high correlation to actually produced chemotypes. We believe that one of the most pressing current challenges of such a database is the linking of molecular identification with chemical diversity of the strains, as well as other metadata. In this study, we use the Tri12 gene involved in mycotoxin biosynthesis for identification of Tri genotypes through sequence comparison. Tri12 sequences from a range of geographically diverse fungal strains comprising 22 Fusarium species were stored in the ToxGen database, which covers descriptive and up-to-date annotations such as indication on Tri genotype and chemotype of the strains, chemical diversity, information on trichothecene-inducing host, substrate or media, geographical locality, and most recent taxonomic affiliations. The present initiative bridges the gap between the demands of comprehensive studies on trichothecene producers and the existing nucleotide sequence databases, which lack toxicological and other auxiliary data. We invite researchers working in the fields of fungal taxonomy, epidemiology and mycotoxicology to join the

  8. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper

  9. Support Vector Machines for Improved Peptide Identification from Tandem Mass Spectrometry Database Search

    SciTech Connect

    Webb-Robertson, Bobbie-Jo M.

    2009-05-06

    Accurate identification of peptides is a current challenge in mass spectrometry (MS) based proteomics. The standard approach uses a search routine to compare tandem mass spectra to a database of peptides associated with the target organism. These database search routines yield multiple metrics associated with the quality of the mapping of the experimental spectrum to the theoretical spectrum of a peptide. The structure of these results make separating correct from false identifications difficult and has created a false identification problem. Statistical confidence scores are an approach to battle this false positive problem that has led to significant improvements in peptide identification. We have shown that machine learning, specifically support vector machine (SVM), is an effective approach to separating true peptide identifications from false ones. The SVM-based peptide statistical scoring method transforms a peptide into a vector representation based on database search metrics to train and validate the SVM. In practice, following the database search routine, a peptides is denoted in its vector representation and the SVM generates a single statistical score that is then used to classify presence or absence in the sample

  10. SMPDB 2.0: big improvements to the Small Molecule Pathway Database.

    PubMed

    Jewison, Timothy; Su, Yilu; Disfany, Fatemeh Miri; Liang, Yongjie; Knox, Craig; Maciejewski, Adam; Poelzer, Jenna; Huynh, Jessica; Zhou, You; Arndt, David; Djoumbou, Yannick; Liu, Yifeng; Deng, Lu; Guo, An Chi; Han, Beomsoo; Pon, Allison; Wilson, Michael; Rafatnia, Shahrzad; Liu, Philip; Wishart, David S

    2014-01-01

    The Small Molecule Pathway Database (SMPDB, http://www.smpdb.ca) is a comprehensive, colorful, fully searchable and highly interactive database for visualizing human metabolic, drug action, drug metabolism, physiological activity and metabolic disease pathways. SMPDB contains >600 pathways with nearly 75% of its pathways not found in any other database. All SMPDB pathway diagrams are extensively hyperlinked and include detailed information on the relevant tissues, organs, organelles, subcellular compartments, protein cofactors, protein locations, metabolite locations, chemical structures and protein quaternary structures. Since its last release in 2010, SMPDB has undergone substantial upgrades and significant expansion. In particular, the total number of pathways in SMPDB has grown by >70%. Additionally, every previously entered pathway has been completely redrawn, standardized, corrected, updated and enhanced with additional molecular or cellular information. Many SMPDB pathways now include transporter proteins as well as much more physiological, tissue, target organ and reaction compartment data. Thanks to the development of a standardized pathway drawing tool (called PathWhiz) all SMPDB pathways are now much more easily drawn and far more rapidly updated. PathWhiz has also allowed all SMPDB pathways to be saved in a BioPAX format. Significant improvements to SMPDB's visualization interface now make the browsing, selection, recoloring and zooming of pathways far easier and far more intuitive. Because of its utility and breadth of coverage, SMPDB is now integrated into several other databases including HMDB and DrugBank.

  11. An extended bioreaction database that significantly improves reconstruction and analysis of genome-scale metabolic networks.

    PubMed

    Stelzer, Michael; Sun, Jibin; Kamphans, Tom; Fekete, Sándor P; Zeng, An-Ping

    2011-11-01

    The bioreaction database established by Ma and Zeng (Bioinformatics, 2003, 19, 270-277) for in silico reconstruction of genome-scale metabolic networks has been widely used. Based on more recent information in the reference databases KEGG LIGAND and Brenda, we upgrade the bioreaction database in this work by almost doubling the number of reactions from 3565 to 6851. Over 70% of the reactions have been manually updated/revised in terms of reversibility, reactant pairs, currency metabolites and error correction. For the first time, 41 spontaneous sugar mutarotation reactions are introduced into the biochemical database. The upgrade significantly improves the reconstruction of genome scale metabolic networks. Many gaps or missing biochemical links can be recovered, as exemplified with three model organisms Homo sapiens, Aspergillus niger, and Escherichia coli. The topological parameters of the constructed networks were also largely affected, however, the overall network structure remains scale-free. Furthermore, we consider the problem of computing biologically feasible shortest paths in reconstructed metabolic networks. We show that these paths are hard to compute and present solutions to find such paths in networks of small and medium size.

  12. Updating annotations with the distributed annotation system and the automated sequence annotation pipeline

    PubMed Central

    Speier, William; Ochs, Michael F.

    2012-01-01

    Summary: The integration between BioDAS ProServer and Automated Sequence Annotation Pipeline (ASAP) provides an interface for querying diverse annotation sources, chaining and linking results, and standardizing the output using the Distributed Annotation System (DAS) protocol. This interface allows pipeline plans in ASAP to be integrated into any system using HTTP and also allows the information returned by ASAP to be included in the DAS registry for use in any DAS-aware system. Three example implementations have been developed: the first accesses TRANSFAC information to automatically create gene sets for the Coordinated Gene Activity in Pattern Sets (CoGAPS) algorithm; the second integrates annotations from multiple array platforms and provides unified annotations in an R environment; and the third wraps the UniProt database for integration with the SPICE DAS client. Availability: Source code for ASAP 2.7 and the DAS 1.6 interface is available under the GNU public license. Proserver 2.20 is free software available from SourceForge. Scripts for installation and configuration on Linux are provided at our website: http://www.rits.onc.jhmi.edu/dbb/custom/A6/ Contact: Speier@mii.ucla.edu or mfo@jhu.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22945787

  13. The UCSC Genome Browser database: 2014 update

    PubMed Central

    Karolchik, Donna; Barber, Galt P.; Casper, Jonathan; Clawson, Hiram; Cline, Melissa S.; Diekhans, Mark; Dreszer, Timothy R.; Fujita, Pauline A.; Guruvadoo, Luvina; Haeussler, Maximilian; Harte, Rachel A.; Heitner, Steve; Hinrichs, Angie S.; Learned, Katrina; Lee, Brian T.; Li, Chin H.; Raney, Brian J.; Rhead, Brooke; Rosenbloom, Kate R.; Sloan, Cricket A.; Speir, Matthew L.; Zweig, Ann S.; Haussler, David; Kuhn, Robert M.; Kent, W. James

    2014-01-01

    The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a large collection of organisms, primarily vertebrates, with an emphasis on the human and mouse genomes. The Browser’s web-based tools provide an integrated environment for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic data sets. As of September 2013, the database contained genomic sequence and a basic set of annotation ‘tracks’ for ∼90 organisms. Significant new annotations include a 60-species multiple alignment conservation track on the mouse, updated UCSC Genes tracks for human and mouse, and several new sets of variation and ENCODE data. New software tools include a Variant Annotation Integrator that returns predicted functional effects of a set of variants uploaded as a custom track, an extension to UCSC Genes that displays haplotype alleles for protein-coding genes and an expansion of data hubs that includes the capability to display remotely hosted user-provided assembly sequence in addition to annotation data. To improve European access, we have added a Genome Browser mirror (http://genome-euro.ucsc.edu) hosted at Bielefeld University in Germany. PMID:24270787

  14. The UCSC Genome Browser database: 2014 update.

    PubMed

    Karolchik, Donna; Barber, Galt P; Casper, Jonathan; Clawson, Hiram; Cline, Melissa S; Diekhans, Mark; Dreszer, Timothy R; Fujita, Pauline A; Guruvadoo, Luvina; Haeussler, Maximilian; Harte, Rachel A; Heitner, Steve; Hinrichs, Angie S; Learned, Katrina; Lee, Brian T; Li, Chin H; Raney, Brian J; Rhead, Brooke; Rosenbloom, Kate R; Sloan, Cricket A; Speir, Matthew L; Zweig, Ann S; Haussler, David; Kuhn, Robert M; Kent, W James

    2014-01-01

    The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a large collection of organisms, primarily vertebrates, with an emphasis on the human and mouse genomes. The Browser's web-based tools provide an integrated environment for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic data sets. As of September 2013, the database contained genomic sequence and a basic set of annotation 'tracks' for ∼90 organisms. Significant new annotations include a 60-species multiple alignment conservation track on the mouse, updated UCSC Genes tracks for human and mouse, and several new sets of variation and ENCODE data. New software tools include a Variant Annotation Integrator that returns predicted functional effects of a set of variants uploaded as a custom track, an extension to UCSC Genes that displays haplotype alleles for protein-coding genes and an expansion of data hubs that includes the capability to display remotely hosted user-provided assembly sequence in addition to annotation data. To improve European access, we have added a Genome Browser mirror (http://genome-euro.ucsc.edu) hosted at Bielefeld University in Germany.

  15. Improvement of the Database on the 1.13-microns Band of Water Vapor

    NASA Technical Reports Server (NTRS)

    Giver, Lawrence P.; Schwenke, David W.; Chackerian, Charles, Jr.; Varanasi, Prasad; Freedman, Richard S.; Gore, Warren J. (Technical Monitor)

    2000-01-01

    Corrections have recently been reported (Giver et al.) on the short-wave (visible and near-infrared) line intensities of water vapor that were catalogued in the spectroscopic database known as HITRAN. These updates have been posted on www.hitran.com, and are being used to reanalyze the polar stratospheric absorption in the 0.94 microns band as observed in POAM. We are currently investigating additional improvement in the 1.13 microns band using data obtained by us with an absorption path length of 1.107 km and 4 torr of water vapor and the ab initio line list of Partridge and Schwenke (needs ref). We are proposing the following four types of improvement of the HITRAN database in this region: 1) HITRAN has nearly 200 lines in this region without proper assignments of rotational quantum levels. Nearly all of them can now be assigned. 2) We have measured positions of the observable H2O-17 and H2O-18 lines. These lines in HITRAN currently have approximate positions based upon rather aged computations. 3) Some additional lines are observed and assigned which should be included in the database. 4) Corrections are necessary for the lower state energies E" for the HITRAN lines of the 121-010 "hot" band.

  16. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.

    PubMed

    Quast, Christian; Pruesse, Elmar; Yilmaz, Pelin; Gerken, Jan; Schweer, Timmy; Yarza, Pablo; Peplies, Jörg; Glöckner, Frank Oliver

    2013-01-01

    SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.

  17. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

    PubMed Central

    Quast, Christian; Pruesse, Elmar; Yilmaz, Pelin; Gerken, Jan; Schweer, Timmy; Yarza, Pablo; Peplies, Jörg; Glöckner, Frank Oliver

    2013-01-01

    SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches. PMID:23193283

  18. Computer systems for annotation of single molecule fragments

    DOEpatents

    Schwartz, David Charles; Severin, Jessica

    2016-07-19

    There are provided computer systems for visualizing and annotating single molecule images. Annotation systems in accordance with this disclosure allow a user to mark and annotate single molecules of interest and their restriction enzyme cut sites thereby determining the restriction fragments of single nucleic acid molecules. The markings and annotations may be automatically generated by the system in certain embodiments and they may be overlaid translucently onto the single molecule images. An image caching system may be implemented in the computer annotation systems to reduce image processing time. The annotation systems include one or more connectors connecting to one or more databases capable of storing single molecule data as well as other biomedical data. Such diverse array of data can be retrieved and used to validate the markings and annotations. The annotation systems may be implemented and deployed over a computer network. They may be ergonomically optimized to facilitate user interactions.

  19. Validating Annotations for Uncharacterized Proteins in Shewanella oneidensis

    PubMed Central

    Louie, Brenton; Tarczy-Hornoch, Peter; Higdon, Roger

    2008-01-01

    Abstract Proteins of unknown function are a barrier to our understanding of molecular biology. Assigning function to these “uncharacterized” proteins is imperative, but challenging. The usual approach is similarity searches using annotation databases, which are useful for predicting function. However, since the performance of these databases on uncharacterized proteins is basically unknown, the accuracy of their predictions is suspect, making annotation difficult. To address this challenge, we developed a benchmark annotation dataset of 30 proteins in Shewanella oneidensis. The proteins in the dataset were originally uncharacterized after the initial annotation of the S. oneidensis proteome in 2002. In the intervening 5 years, the accumulation of new experimental evidence has enabled specific functions to be predicted. We utilized this benchmark dataset to evaluate several commonly utilized annotation databases. According to our criteria, six annotation databases accurately predicted functions for at least 60% of proteins in our dataset. Two of these six even had a “conditional accuracy” of 90%. Conditional accuracy is another evaluation metric we developed which excludes results from databases where no function was predicted. Also, 27 of the 30 proteins' functions were correctly predicted by at least one database. These represent one of the first performance evaluations of annotation databases on uncharacterized proteins. Our evaluation indicates that these databases readily incorporate new information and are accurate in predicting functions for uncharacterized proteins, provided that experimental function evidence exists. PMID:18687039

  20. Gene Ontology Annotations and Resources

    PubMed Central

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  1. Dictionary-driven protein annotation.

    PubMed

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  2. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes

    PubMed Central

    Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E.; Thomas, Paul D.

    2016-01-01

    We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This ‘GO Phylogenetic Annotation’ approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations. Database URL: http://amigo.geneontology.org/amigo PMID:28025345

  3. CyanoBase: the cyanobacteria genome database update 2010.

    PubMed

    Nakao, Mitsuteru; Okamoto, Shinobu; Kohara, Mitsuyo; Fujishiro, Tsunakazu; Fujisawa, Takatomo; Sato, Shusei; Tabata, Satoshi; Kaneko, Takakazu; Nakamura, Yasukazu

    2010-01-01

    CyanoBase (http://genome.kazusa.or.jp/cyanobase) is the genome database for cyanobacteria, which are model organisms for photosynthesis. The database houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. In this version, we updated these datasets and improved the navigation and the visual display of the data views. In addition, a web service API now enables users to retrieve the data in various formats with other tools, seamlessly.

  4. Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis

    PubMed Central

    Done, Bogdan; Khatri, Purvesh; Done, Arina; Draghici, Sorin

    2013-01-01

    The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach. PMID:20150671

  5. Database Constraints Applied to Metabolic Pathway Reconstruction Tools

    PubMed Central

    Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi

    2014-01-01

    Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes. PMID:25202745

  6. Improving Quality and Quantity of Contributions: Two Models for Promoting Knowledge Exchange with Shared Databases

    ERIC Educational Resources Information Center

    Cress, U.; Barquero, B.; Schwan, S.; Hesse, F. W.

    2007-01-01

    Shared databases are used for knowledge exchange in groups. Whether a person is willing to contribute knowledge to a shared database presents a social dilemma: Each group member saves time and energy by not contributing any information to the database and by using the database only to retrieve information which was contributed by others. But if…

  7. Harnessing Collaborative Annotations on Online Formative Assessments

    ERIC Educational Resources Information Center

    Lin, Jian-Wei; Lai, Yuan-Cheng

    2013-01-01

    This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…

  8. Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database.

    PubMed

    Davis, Allan Peter; Johnson, Robin J; Lennon-Hopkins, Kelley; Sciaky, Daniela; Rosenstein, Michael C; Wiegers, Thomas C; Mattingly, Carolyn J

    2012-01-01

    The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical-gene, chemical-disease and gene-disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009-2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL

  9. Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database

    PubMed Central

    Davis, Allan Peter; Johnson, Robin J.; Lennon-Hopkins, Kelley; Sciaky, Daniela; Rosenstein, Michael C.; Wiegers, Thomas C.; Mattingly, Carolyn J.

    2012-01-01

    The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical–gene, chemical–disease and gene–disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009–2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database

  10. Improving machine operation management efficiency via improving the vehicle park structure and using the production operation information database

    NASA Astrophysics Data System (ADS)

    Koptev, V. Yu

    2017-02-01

    The work represents the results of studying basic interconnected criteria of separate equipment units of the transport network machines fleet, depending on production and mining factors to improve the transport systems management. Justifying the selection of a control system necessitates employing new methodologies and models, augmented with stability and transport flow criteria, accounting for mining work development dynamics on mining sites. A necessary condition is the accounting of technical and operating parameters related to vehicle operation. Modern open pit mining dispatching systems must include such kinds of the information database. An algorithm forming a machine fleet is presented based on multi-variation task solution in connection with defining reasonable operating features of a machine working as a part of a complex. Proposals cited in the work may apply to mining machines (drilling equipment, excavators) and construction equipment (bulldozers, cranes, pile-drivers), city transport and other types of production activities using machine fleet.

  11. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies.

    PubMed

    Lu, Qiongshi; Powles, Ryan Lee; Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-04-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.

  12. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    PubMed Central

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  13. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.

    PubMed

    South, Brett R; Mowery, Danielle; Suo, Ying; Leng, Jianwei; Ferrández, Óscar; Meystre, Stephane M; Chapman, Wendy W

    2014-08-01

    The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method requires removal of 18 types of protected health information (PHI) from clinical documents to be considered "de-identified" prior to use for research purposes. Human review of PHI elements from a large corpus of clinical documents can be tedious and error-prone. Indeed, multiple annotators may be required to consistently redact information that represents each PHI class. Automated de-identification has the potential to improve annotation quality and reduce annotation time. For instance, using machine-assisted annotation by combining de-identification system outputs used as pre-annotations and an interactive annotation interface to provide annotators with PHI annotations for "curation" rather than manual annotation from "scratch" on raw clinical documents. In order to assess whether machine-assisted annotation improves the reliability and accuracy of the reference standard quality and reduces annotation effort, we conducted an annotation experiment. In this annotation study, we assessed the generalizability of the VA Consortium for Healthcare Informatics Research (CHIR) annotation schema and guidelines applied to a corpus of publicly available clinical documents called MTSamples. Specifically, our goals were to (1) characterize a heterogeneous corpus of clinical documents manually annotated for risk-ranked PHI and other annotation types (clinical eponyms and person relations), (2) evaluate how well annotators apply the CHIR schema to the heterogeneous corpus, (3) compare whether machine-assisted annotation (experiment) improves annotation quality and reduces annotation time compared to manual annotation (control), and (4) assess the change in quality of reference standard coverage with each added annotator's annotations.

  14. Orienteering: An Annotated Bibliography = Orientierungslauf: Eine kommentierte Bibliographie.

    ERIC Educational Resources Information Center

    Seiler, Roland, Ed.; Hartmann, Wolfgang, Ed.

    1994-01-01

    Annotated bibliography of 220 books, monographs, and journal articles on orienteering published 1984-94, from SPOLIT database of the Federal Institute of Sport Science (Cologne, Germany). Annotations in English or German. Ten sections including psychological, physiological, health, sociological, and environmental aspects; training and coaching;…

  15. Annotation of Fusarium graminearum (PH-1) Version 5.0

    PubMed Central

    Hammond-Kosack, Kim E.

    2017-01-01

    ABSTRACT Fusarium graminearum floral infections are a major risk to the global supply of safe cereal grains. We report updates to the PH-1 reference genome and significant improvements to the annotation. Changes include introduction of legacy annotation identifiers, new gene models, secretome and effectorP predictions, and inclusion of extensive untranslated region (UTR) annotations. PMID:28082505

  16. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments

    PubMed Central

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598

  17. The Basic Course: A Selected, Annotated Bibliography.

    ERIC Educational Resources Information Center

    Demo, Penny

    Defining basic speech communication courses as those public speaking, interpersonal, or communication courses that treat fundamental communication concepts, this annotated bibliography reflects the current thought of speech educators on the basic course. The bibliography consists of 27 citations, all of which are drawn from the ERIC database. (SKC)

  18. Scientific and Technical Document Database

    National Institute of Standards and Technology Data Gateway

    NIST Scientific and Technical Document Database (PC database for purchase)   The images in NIST Special Database 20 contain a very rich set of graphic elements from scientific and technical documents, such as graphs, tables, equations, two column text, maps, pictures, footnotes, annotations, and arrays of such elements.

  19. Transparency, usability, and reproducibility: Guiding principles for improving comparative databases using primates as examples.

    PubMed

    Borries, Carola; Sandel, Aaron A; Koenig, Andreas; Fernandez-Duque, Eduardo; Kamilar, Jason M; Amoroso, Caroline R; Barton, Robert A; Bray, Joel; Di Fiore, Anthony; Gilby, Ian C; Gordon, Adam D; Mundry, Roger; Port, Markus; Powell, Lauren E; Pusey, Anne E; Spriggs, Amanda; Nunn, Charles L

    2016-09-01

    Recent decades have seen rapid development of new analytical methods to investigate patterns of interspecific variation. Yet these cutting-edge statistical analyses often rely on data of questionable origin, varying accuracy, and weak comparability, which seem to have reduced the reproducibility of studies. It is time to improve the transparency of comparative data while also making these improved data more widely available. We, the authors, met to discuss how transparency, usability, and reproducibility of comparative data can best be achieved. We propose four guiding principles: 1) data identification with explicit operational definitions and complete descriptions of methods; 2) inclusion of metadata that capture key characteristics of the data, such as sample size, geographic coordinates, and nutrient availability (for example, captive versus wild animals); 3) documentation of the original reference for each datum; and 4) facilitation of effective interactions with the data via user friendly and transparent interfaces. We urge reviewers, editors, publishers, database developers and users, funding agencies, researchers publishing their primary data, and those performing comparative analyses to embrace these standards to increase the transparency, usability, and reproducibility of comparative studies.

  20. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation

    PubMed Central

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M.; Scannapieco, Frank A.

    2016-01-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease. PMID:25911337

  1. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation.

    PubMed

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M; Scannapieco, Frank A

    2015-06-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

  2. Computing human image annotation.

    PubMed

    Channin, David S; Mongkolwat, Pattanasak; Kleper, Vladimir; Rubin, Daniel L

    2009-01-01

    An image annotation is the explanatory or descriptive information about the pixel data of an image that is generated by a human (or machine) observer. An image markup is the graphical symbols placed over the image to depict an annotation. In the majority of current, clinical and research imaging practice, markup is captured in proprietary formats and annotations are referenced only in free text radiology reports. This makes these annotations difficult to query, retrieve and compute upon, hampering their integration into other data mining and analysis efforts. This paper describes the National Cancer Institute's Cancer Biomedical Informatics Grid's (caBIG) Annotation and Image Markup (AIM) project, focusing on how to use AIM to query for annotations. The AIM project delivers an information model for image annotation and markup. The model uses controlled terminologies for important concepts. All of the classes and attributes of the model have been harmonized with the other models and common data elements in use at the National Cancer Institute. The project also delivers XML schemata necessary to instantiate AIMs in XML as well as a software application for translating AIM XML into DICOM S/R and HL7 CDA. Large collections of AIM annotations can be built and then queried as Grid or Web services. Using the tools of the AIM project, image annotations and their markup can be captured and stored in human and machine readable formats. This enables the inclusion of human image observation and inference as part of larger data mining and analysis activities.

  3. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction.

    PubMed

    Névéol, Aurélie; Islamaj Doğan, Rezarta; Lu, Zhiyong

    2011-04-01

    Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.

  4. Evaluating techniques for metagenome annotation using simulated sequence data

    PubMed Central

    Randle-Boggis, Richard J.; Helgason, Thorunn; Sapp, Melanie; Ashton, Peter D.

    2016-01-01

    The advent of next-generation sequencing has allowed huge amounts of DNA sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The current challenge is to identify from which microorganisms and genes the DNA originated. Several tools and databases are available for annotating DNA sequences. The tools, databases and parameters used can have a significant impact on the results: naïve choice of these factors can result in a false representation of community composition and function. We use a simulated metagenome to show how different parameters affect annotation accuracy by evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and Megablast. This simulated metagenome allowed the recovery of known organism and function abundances to be quantitatively evaluated, which is not possible for environmental metagenomes. The performance of each program and database varied, e.g. One Codex correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced many false positive annotations. This effect decreased as the taxonomic level investigated increased. Selecting more stringent parameters decreases the annotation sensitivity, but increases precision. Ultimately, there is a trade-off between taxonomic resolution and annotation accuracy. These results should be considered when annotating metagenomes and interpreting results from previous studies. PMID:27162180

  5. Systemic inaccuracies in the National Surgical Quality Improvement Program database: Implications for accuracy and validity for neurosurgery outcomes research.

    PubMed

    Rolston, John D; Han, Seunggu J; Chang, Edward F

    2017-03-01

    The American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) provides a rich database of North American surgical procedures and their complications. Yet no external source has validated the accuracy of the information within this database. Using records from the 2006 to 2013 NSQIP database, we used two methods to identify errors: (1) mismatches between the Current Procedural Terminology (CPT) code that was used to identify the surgical procedure, and the International Classification of Diseases (ICD-9) post-operative diagnosis: i.e., a diagnosis that is incompatible with a certain procedure. (2) Primary anesthetic and CPT code mismatching: i.e., anesthesia not indicated for a particular procedure. Analyzing data for movement disorders, epilepsy, and tumor resection, we found evidence of CPT code and postoperative diagnosis mismatches in 0.4-100% of cases, depending on the CPT code examined. When analyzing anesthetic data from brain tumor, epilepsy, trauma, and spine surgery, we found evidence of miscoded anesthesia in 0.1-0.8% of cases. National databases like NSQIP are an important tool for quality improvement. Yet all databases are subject to errors, and measures of internal consistency show that errors affect up to 100% of case records for certain procedures in NSQIP. Steps should be taken to improve data collection on the frontend of NSQIP, and also to ensure that future studies with NSQIP take steps to exclude erroneous cases from analysis.

  6. Improving pharmaceutical innovation by building a more comprehensive database on drug development and use.

    PubMed

    Daniel, Gregory W; Cazé, Alexis; Romine, Morgan H; Audibert, Céline; Leff, Jonathan S; McClellan, Mark B

    2015-02-01

    New drugs and biologics have had a tremendous impact on the treatment of many diseases. However, available measures suggest that pharmaceutical innovation has remained relatively flat, despite substantial growth in research and development spending. We review recent literature on pharmaceutical innovation to identify limitations in measuring and assessing innovation, and we describe the framework and collaborative approach we are using to develop more comprehensive, publicly available metrics for innovation. Our research teams at the Brookings Institution and Deerfield Institute are collaborating with experts from multiple areas of drug development and regulatory review to identify and collect comprehensive data elements related to key development and regulatory characteristics for each new molecular entity approved over the past several decades in the United States and the European Union. Subsequent phases of our effort will add data on downstream product use and patient outcomes and will also include drugs that have failed or been abandoned in development. Such a database will enable researchers to better analyze the drivers of drug innovation, trends in the output of new medicines, and the effect of policy efforts designed to improve innovation.

  7. DFAST and DAGA: web-based integrated genome annotation tools and resources

    PubMed Central

    TANIZAWA, Yasuhiro; FUJISAWA, Takatomo; KAMINUMA, Eli; NAKAMURA, Yasukazu; ARITA, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp. PMID:27867804

  8. DFAST and DAGA: web-based integrated genome annotation tools and resources.

    PubMed

    Tanizawa, Yasuhiro; Fujisawa, Takatomo; Kaminuma, Eli; Nakamura, Yasukazu; Arita, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp.

  9. Omics data management and annotation.

    PubMed

    Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron

    2011-01-01

    Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter.

  10. Development and Evaluation of an Automated Annotation Pipeline and cDNA Annotation System

    PubMed Central

    Kasukawa, Takeya; Furuno, Masaaki; Nikaido, Itoshi; Bono, Hidemasa; Hume, David A.; Bult, Carol; Hill, David P.; Baldarelli, Richard; Gough, Julian; Kanapin, Alexander; Matsuda, Hideo; Schriml, Lynn M.; Hayashizaki, Yoshihide; Okazaki, Yasushi; Quackenbush, John

    2003-01-01

    Manual curation has long been held to be the “gold standard” for functional annotation of DNA sequence. Our experience with the annotation of more than 20,000 full-length cDNA sequences revealed problems with this approach, including inaccurate and inconsistent assignment of gene names, as well as many good assignments that were difficult to reproduce using only computational methods. For the FANTOM2 annotation of more than 60,000 cDNA clones, we developed a number of methods and tools to circumvent some of these problems, including an automated annotation pipeline that provides high-quality preliminary annotation for each sequence by introducing an “uninformative filter” that eliminates uninformative annotations, controlled vocabularies to accurately reflect both the functional assignments and the evidence supporting them, and a highly refined, Web-based manual annotation tool that allows users to view a wide array of sequence analyses and to assign gene names and putative functions using a consistent nomenclature. The ultimate utility of our approach is reflected in the low rate of reassignment of automated assignments by manual curation. Based on these results, we propose a new standard for large-scale annotation, in which the initial automated annotations are manually investigated and then computational methods are iteratively modified and improved based on the results of manual curation. PMID:12819153

  11. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex.

    PubMed

    Maccari, Giuseppe; Robinson, James; Ballingall, Keith; Guethlein, Lisbeth A; Grimholt, Unni; Kaufman, Jim; Ho, Chak-Sum; de Groot, Natasja G; Flicek, Paul; Bontrop, Ronald E; Hammond, John A; Marsh, Steven G E

    2017-01-04

    The IPD-MHC Database project (http://www.ebi.ac.uk/ipd/mhc/) collects and expertly curates sequences of the major histocompatibility complex from non-human species and provides the infrastructure and tools to enable accurate analysis. Since the first release of the database in 2003, IPD-MHC has grown and currently hosts a number of specific sections, with more than 7000 alleles from 70 species, including non-human primates, canines, felines, equids, ovids, suids, bovins, salmonids and murids. These sequences are expertly curated and made publicly available through an open access website. The IPD-MHC Database is a key resource in its field, and this has led to an average of 1500 unique visitors and more than 5000 viewed pages per month. As the database has grown in size and complexity, it has created a number of challenges in maintaining and organizing information, particularly the need to standardize nomenclature and taxonomic classification, while incorporating new allele submissions. Here, we describe the latest database release, the IPD-MHC 2.0 and discuss planned developments. This release incorporates sequence updates and new tools that enhance database queries and improve the submission procedure by utilizing common tools that are able to handle the varied requirements of each MHC-group.

  12. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex

    PubMed Central

    Maccari, Giuseppe; Robinson, James; Ballingall, Keith; Guethlein, Lisbeth A.; Grimholt, Unni; Kaufman, Jim; Ho, Chak-Sum; de Groot, Natasja G.; Flicek, Paul; Bontrop, Ronald E.; Hammond, John A.; Marsh, Steven G. E.

    2017-01-01

    The IPD-MHC Database project (http://www.ebi.ac.uk/ipd/mhc/) collects and expertly curates sequences of the major histocompatibility complex from non-human species and provides the infrastructure and tools to enable accurate analysis. Since the first release of the database in 2003, IPD-MHC has grown and currently hosts a number of specific sections, with more than 7000 alleles from 70 species, including non-human primates, canines, felines, equids, ovids, suids, bovins, salmonids and murids. These sequences are expertly curated and made publicly available through an open access website. The IPD-MHC Database is a key resource in its field, and this has led to an average of 1500 unique visitors and more than 5000 viewed pages per month. As the database has grown in size and complexity, it has created a number of challenges in maintaining and organizing information, particularly the need to standardize nomenclature and taxonomic classification, while incorporating new allele submissions. Here, we describe the latest database release, the IPD-MHC 2.0 and discuss planned developments. This release incorporates sequence updates and new tools that enhance database queries and improve the submission procedure by utilizing common tools that are able to handle the varied requirements of each MHC-group. PMID:27899604

  13. Annotated Humanities Programs.

    ERIC Educational Resources Information Center

    Adler, Richard R.; Applebee, Arthur

    The humanities programs offered in 1968 by 227 United States secondary schools are listed alphabetically by state, including almost 100 new programs not annotated in the 1967 listing (see TE 000 224). Each annotation presents a brief description of the approach to study used in the particular humanities course (e.g., American Studies, Culture…

  14. Automated Knowledge Annotation for Dynamic Collaborative Environments

    SciTech Connect

    Cowell, Andrew J.; Gregory, Michelle L.; Marshall, Eric J.; McGrath, Liam R.

    2009-05-19

    This paper describes the Knowledge Encapsulation Framework (KEF), a suite of tools to enable automated knowledge annotation for modeling and simulation projects. This framework can be used to capture evidence (e.g., facts extracted from journal articles and government reports), discover new evidence (from similar peer-reviewed material as well as social media), enable discussions surrounding domain-specific topics and provide automatically generated semantic annotations for improved corpus investigation. The current KEF implementation is presented within a wiki environment, providing a simple but powerful collaborative space for team members to review, annotate, discuss and align evidence with their modeling frameworks.

  15. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments.

    PubMed

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis-clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Database URL: http://www.cbil.upenn.edu/annotCompute/

  16. Relational-database model for improving quality assurance and process control in a composite manufacturing environment

    NASA Astrophysics Data System (ADS)

    Gentry, Jeffery D.

    2000-05-01

    A relational database is a powerful tool for collecting and analyzing the vast amounts of inner-related data associated with the manufacture of composite materials. A relational database contains many individual database tables that store data that are related in some fashion. Manufacturing process variables as well as quality assurance measurements can be collected and stored in database tables indexed according to lot numbers, part type or individual serial numbers. Relationships between manufacturing process and product quality can then be correlated over a wide range of product types and process variations. This paper presents details on how relational databases are used to collect, store, and analyze process variables and quality assurance data associated with the manufacture of advanced composite materials. Important considerations are covered including how the various types of data are organized and how relationships between the data are defined. Employing relational database techniques to establish correlative relationships between process variables and quality assurance measurements is then explored. Finally, the benefits of database techniques such as data warehousing, data mining and web based client/server architectures are discussed in the context of composite material manufacturing.

  17. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  18. DDBJ progress report: a new submission system for leading to a correct annotation

    PubMed Central

    Kosuge, Takehide; Mashima, Jun; Kodama, Yuichi; Fujisawa, Takatomo; Kaminuma, Eli; Ogasawara, Osamu; Okubo, Kousaku; Takagi, Toshihisa; Nakamura, Yasukazu

    2014-01-01

    The DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp) maintains and provides archival, retrieval and analytical resources for biological information. This database content is shared with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). DDBJ launched a new nucleotide sequence submission system for receiving traditional nucleotide sequence. We expect that the new submission system will be useful for many submitters to input accurate annotation and reduce the time needed for data input. In addition, DDBJ has started a new service, the Japanese Genotype–phenotype Archive (JGA), with our partner institute, the National Bioscience Database Center (NBDC). JGA permanently archives and shares all types of individual human genetic and phenotypic data. We also introduce improvements in the DDBJ services and databases made during the past year. PMID:24194602

  19. Improved Differentiation of Streptococcus pneumoniae and Other S. mitis Group Streptococci by MALDI Biotyper Using an Improved MALDI Biotyper Database Content and a Novel Result Interpretation Algorithm.

    PubMed

    Harju, Inka; Lange, Christoph; Kostrzewa, Markus; Maier, Thomas; Rantakokko-Jalava, Kaisu; Haanperä, Marjo

    2017-03-01

    Reliable distinction of Streptococcus pneumoniae and viridans group streptococci is important because of the different pathogenic properties of these organisms. Differentiation between S. pneumoniae and closely related Sreptococcusmitis species group streptococci has always been challenging, even when using such modern methods as 16S rRNA gene sequencing or matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry. In this study, a novel algorithm combined with an enhanced database was evaluated for differentiation between S. pneumoniae and S. mitis species group streptococci. One hundred one clinical S. mitis species group streptococcal strains and 188 clinical S. pneumoniae strains were identified by both the standard MALDI Biotyper database alone and that combined with a novel algorithm. The database update from 4,613 strains to 5,627 strains drastically improved the differentiation of S. pneumoniae and S. mitis species group streptococci: when the new database version containing 5,627 strains was used, only one of the 101 S. mitis species group isolates was misidentified as S. pneumoniae, whereas 66 of them were misidentified as S. pneumoniae when the earlier 4,613-strain MALDI Biotyper database version was used. The updated MALDI Biotyper database combined with the novel algorithm showed even better performance, producing no misidentifications of the S. mitis species group strains as S. pneumoniae All S. pneumoniae strains were correctly identified as S. pneumoniae with both the standard MALDI Biotyper database and the standard MALDI Biotyper database combined with the novel algorithm. This new algorithm thus enables reliable differentiation between pneumococci and other S. mitis species group streptococci with the MALDI Biotyper.

  20. ERAIZDA: a model for holistic annotation of animal infectious and zoonotic diseases

    PubMed Central

    Buza, Teresia M.; Jack, Sherman W.; Kirunda, Halid; Khaitsa, Margaret L.; Lawrence, Mark L.; Pruett, Stephen; Peterson, Daniel G.

    2015-01-01

    There is an urgent need for a unified resource that integrates trans-disciplinary annotations of emerging and reemerging animal infectious and zoonotic diseases. Such data integration will provide wonderful opportunity for epidemiologists, researchers and health policy makers to make data-driven decisions designed to improve animal health. Integrating emerging and reemerging animal infectious and zoonotic disease data from a large variety of sources into a unified open-access resource provides more plausible arguments to achieve better understanding of infectious and zoonotic diseases. We have developed a model for interlinking annotations of these diseases. These diseases are of particular interest because of the threats they pose to animal health, human health and global health security. We demonstrated the application of this model using brucellosis, an infectious and zoonotic disease. Preliminary annotations were deposited into VetBioBase database (http://vetbiobase.igbb.msstate.edu). This database is associated with user-friendly tools to facilitate searching, retrieving and downloading of disease-related information. Database URL: http://vetbiobase.igbb.msstate.edu PMID:26581408

  1. HIV Sequence Databases

    PubMed Central

    Kuiken, Carla; Korber, Bette; Shafer, Robert W.

    2008-01-01

    Two important databases are often used in HIV genetic research, the HIV Sequence Database in Los Alamos, which collects all sequences and focuses on annotation and data analysis, and the HIV RT/Protease Sequence Database in Stanford, which collects sequences associated with the development of viral resistance against anti-retroviral drugs and focuses on analysis of those sequences. The types of data and services these two databases offer, the tools they provide, and the way they are set up and operated are described in detail. PMID:12875108

  2. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

    PubMed Central

    Andorf, Carson; Dobbs, Drena; Honavar, Vasant

    2007-01-01

    Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence. PMID:17683567

  3. Complementary use of the SciSearch database for improved biomedical information searching.

    PubMed Central

    Brown, C M

    1998-01-01

    The use of at least two complementary online biomedical databases is generally considered critical for biomedical scientists seeking to keep fully abreast of recent research developments as well as to retrieve the highest number of relevant citations possible. Although the National Library of Medicine's MEDLINE is usually the database of choice, this paper illustrates the benefits of using another database, the Institute for Scientific Information's SciSearch, when conducting a biomedical information search. When a simple query about red wine consumption and coronary artery disease was posed simultaneously in both MEDLINE and SciSearch, a greater number of relevant citations were retrieved through SciSearch. This paper also provides suggestions for carrying out a comprehensive biomedical literature search in a rapid and efficient manner by using SciSearch in conjunction with MEDLINE. PMID:9549014

  4. Improving HJ-1B IRS land surface temperature product using ASTER global emissivity database

    NASA Astrophysics Data System (ADS)

    Li, H.; Hu, T.; Meng, X.; Yongming, D.; Cao, B.; Liu, Q.

    2015-12-01

    Land surface temperature (LST) is a key parameter for hydrological, meteorological, climatological and environmental studies. Currently many operational LST products have been generated using European and American satellite data, i.e., the Advanced Very High Resolution Radiometer (AVHRR), Advanced Along-Track Scanning Radiometer (AATSR) and Moderate Resolution Imaging Spectroradiometer (MODIS). However, few LST product has been produced using Chinese satellite data. Thus, the objective of this study is to generate reliable LST product using Chinese HJ-1B satellite data. The HJ-1B satellite of China, were launched on September 6, 2008, which are used for disaster and environment monitoring. IRS (Infrared Scanner) is one of the key instruments onboard HJ-1B satellite, it can scan the earth every four days, has four spectral bands ranging from the near-infrared to thermal infrared bands (band 1 0.75 - 1.10μm, band 2 1.55-1.75μm, MIR band 3 3.50 - 3.90μm, band 4 10.5-12.5μm) with 720 km swath. It scans ±29° from nadir and the spatial resolution for band1-3 is 150m and 300m for band4. In this study, a single-channel parametric model (SC-PM) algorithm were used to produce 300m LST product from HJ-1B IRS data. The NCEP atmospheric profiles and a parametric model were used for atmospheric correction. In order to improve the accuracy of the land surface emissivity (LSE), the 1km ASTER Global Emissivity Database (GED) and self-developed 5-day 1km vegetation cover product were used for estimating the LSE based on the Vegetation Cover Method. Two years of HJ-1B IRS LST product in Heihe River basin (Gansu province, China) from June 2012 to June 2014 were generated. The LST products were evaluated against ground observations in an arid area of northwest China during the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) experiment. Four barren surface sites and ten vegetated sites were chosen for the evaluation. The results show that the developed HJ-1B IRS

  5. The Coral Triangle Atlas: An Integrated Online Spatial Database System for Improving Coral Reef Management

    PubMed Central

    Cros, Annick; Ahamad Fatan, Nurulhuda; White, Alan; Teoh, Shwu Jiau; Tan, Stanley; Handayani, Christian; Huang, Charles; Peterson, Nate; Venegas Li, Ruben; Siry, Hendra Yusran; Fitriana, Ria; Gove, Jamison; Acoba, Tomoko; Knight, Maurice; Acosta, Renerio; Andrew, Neil; Beare, Doug

    2014-01-01

    In this paper we describe the construction of an online GIS database system, hosted by WorldFish, which stores bio-physical, ecological and socio-economic data for the ‘Coral Triangle Area’ in South-east Asia and the Pacific. The database has been built in partnership with all six (Timor-Leste, Malaysia, Indonesia, The Philippines, Solomon Islands and Papua New Guinea) of the Coral Triangle countries, and represents a valuable source of information for natural resource managers at the regional scale. Its utility is demonstrated using biophysical data, data summarising marine habitats, and data describing the extent of marine protected areas in the region. PMID:24941442

  6. The Coral Triangle Atlas: an integrated online spatial database system for improving coral reef management.

    PubMed

    Cros, Annick; Ahamad Fatan, Nurulhuda; White, Alan; Teoh, Shwu Jiau; Tan, Stanley; Handayani, Christian; Huang, Charles; Peterson, Nate; Venegas Li, Ruben; Siry, Hendra Yusran; Fitriana, Ria; Gove, Jamison; Acoba, Tomoko; Knight, Maurice; Acosta, Renerio; Andrew, Neil; Beare, Doug

    2014-01-01

    In this paper we describe the construction of an online GIS database system, hosted by WorldFish, which stores bio-physical, ecological and socio-economic data for the 'Coral Triangle Area' in South-east Asia and the Pacific. The database has been built in partnership with all six (Timor-Leste, Malaysia, Indonesia, The Philippines, Solomon Islands and Papua New Guinea) of the Coral Triangle countries, and represents a valuable source of information for natural resource managers at the regional scale. Its utility is demonstrated using biophysical data, data summarising marine habitats, and data describing the extent of marine protected areas in the region.

  7. An annotated energy bibliography

    NASA Technical Reports Server (NTRS)

    Blow, S. J.

    1979-01-01

    Comprehensive annotated compilation of books, journals, periodicals, and reports on energy and energy related topics, contains approximately 10,0000 tehcnical and nontechnical references from bibliographic and other sources dated January 1975 through May 1977.

  8. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  9. Improving Child Outcomes with Data-Based Decision Making: Collecting Data

    ERIC Educational Resources Information Center

    Hojnoski, Robin L.; Gischlar, Karen L.; Missall, Kristen N.

    2009-01-01

    Collecting and graphing performance data are important parts of the educational process. Such procedures help educators, caregivers, and other important stakeholders make data-based decisions to accelerate child progress. With school-age children, collecting and graphing data have been associated with more frequent instructional changes to better…

  10. The UCSC Genome Browser database: 2016 update

    PubMed Central

    Speir, Matthew L.; Zweig, Ann S.; Rosenbloom, Kate R.; Raney, Brian J.; Paten, Benedict; Nejad, Parisa; Lee, Brian T.; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S.; Heitner, Steve; Harte, Rachel A.; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A.; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P.; Haussler, David; Kuhn, Robert M.; Kent, W. James

    2016-01-01

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the “Data Integrator”, for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment. PMID:26590259

  11. The UCSC Genome Browser database: 2016 update.

    PubMed

    Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R; Raney, Brian J; Paten, Benedict; Nejad, Parisa; Lee, Brian T; Learned, Katrina; Karolchik, Donna; Hinrichs, Angie S; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Fujita, Pauline A; Eisenhart, Christopher; Diekhans, Mark; Clawson, Hiram; Casper, Jonathan; Barber, Galt P; Haussler, David; Kuhn, Robert M; Kent, W James

    2016-01-04

    For the past 15 years, the UCSC Genome Browser (http://genome.ucsc.edu/) has served the international research community by offering an integrated platform for viewing and analyzing information from a large database of genome assemblies and their associated annotations. The UCSC Genome Browser has been under continuous development since its inception with new data sets and software features added frequently. Some release highlights of this year include new and updated genome browsers for various assemblies, including bonobo and zebrafish; new gene annotation sets; improvements to track and assembly hub support; and a new interactive tool, the "Data Integrator", for intersecting data from multiple tracks. We have greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.

  12. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  13. The ChEMBL database in 2017

    PubMed Central

    Gaulton, Anna; Hersey, Anne; Nowotka, Michał; Bento, A. Patrícia; Chambers, Jon; Mendez, David; Mutowo, Prudence; Atkinson, Francis; Bellis, Louisa J.; Cibrián-Uhalte, Elena; Davies, Mark; Dedman, Nathan; Karlsson, Anneli; Magariños, María Paula; Overington, John P.; Papadatos, George; Smit, Ines; Leach, Andrew R.

    2017-01-01

    ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services. PMID:27899562

  14. The ChEMBL database in 2017.

    PubMed

    Gaulton, Anna; Hersey, Anne; Nowotka, Michał; Bento, A Patrícia; Chambers, Jon; Mendez, David; Mutowo, Prudence; Atkinson, Francis; Bellis, Louisa J; Cibrián-Uhalte, Elena; Davies, Mark; Dedman, Nathan; Karlsson, Anneli; Magariños, María Paula; Overington, John P; Papadatos, George; Smit, Ines; Leach, Andrew R

    2017-01-04

    ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services.

  15. Biomedical article retrieval using multimodal features and image annotations in region-based CBIR

    NASA Astrophysics Data System (ADS)

    You, Daekeun; Antani, Sameer; Demner-Fushman, Dina; Rahman, Md Mahmudur; Govindaraju, Venu; Thoma, George R.

    2010-01-01

    Biomedical images are invaluable in establishing diagnosis, acquiring technical skills, and implementing best practices in many areas of medicine. At present, images needed for instructional purposes or in support of clinical decisions appear in specialized databases and in biomedical articles, and are often not easily accessible to retrieval tools. Our goal is to automatically annotate images extracted from scientific publications with respect to their usefulness for clinical decision support and instructional purposes, and project the annotations onto images stored in databases by linking images through content-based image similarity. Authors often use text labels and pointers overlaid on figures and illustrations in the articles to highlight regions of interest (ROI). These annotations are then referenced in the caption text or figure citations in the article text. In previous research we have developed two methods (a heuristic and dynamic time warping-based methods) for localizing and recognizing such pointers on biomedical images. In this work, we add robustness to our previous efforts by using a machine learning based approach to localizing and recognizing the pointers. Identifying these can assist in extracting relevant image content at regions within the image that are likely to be highly relevant to the discussion in the article text. Image regions can then be annotated using biomedical concepts from extracted snippets of text pertaining to images in scientific biomedical articles that are identified using National Library of Medicine's Unified Medical Language System® (UMLS) Metathesaurus. The resulting regional annotation and extracted image content are then used as indices for biomedical article retrieval using the multimodal features and region-based content-based image retrieval (CBIR) techniques. The hypothesis that such an approach would improve biomedical document retrieval is validated through experiments on an expert-marked biomedical article

  16. The Vertebrate Genome Annotation browser 10 years on.

    PubMed

    Harrow, Jennifer L; Steward, Charles A; Frankish, Adam; Gilbert, James G; Gonzalez, Jose M; Loveland, Jane E; Mudge, Jonathan; Sheppard, Dan; Thomas, Mark; Trevanion, Stephen; Wilming, Laurens G

    2014-01-01

    The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).

  17. Visualizing Genomic Annotations with the UCSC Genome Browser.

    PubMed

    Hung, Jui-Hung; Weng, Zhiping

    2016-11-01

    Genomic data and annotations are rapidly accumulating in databases such as the UCSC Genome Browser, NCBI, and Ensembl. Given the massive scale of these genomic databases, it is important to be able to easily retrieve known data and annotations of a specified genomic locus. For example, for a newly identified cis-regulatory element bound by a transcription factor, questions that immediately come to mind include whether the element is near a transcriptional start site and, if so, the name of the corresponding gene, and whether the histones or DNA at the locus are modified. The UCSC Genome Browser organizes data and annotations (called tracks) around the reference sequences or draft assemblies of many eukaryotic genomes and presents them using a powerful web-based graphical interface. This protocol describes how to use the UCSC Genome Browser to visualize selected tracks at specified genomic regions, download the data and annotations for further analysis, and retrieve multiple sequence alignments and their conservation scores.

  18. The Genopolis Microarray Database

    PubMed Central

    Splendiani, Andrea; Brandizi, Marco; Even, Gael; Beretta, Ottavio; Pavelka, Norman; Pelizzola, Mattia; Mayhaus, Manuel; Foti, Maria; Mauri, Giancarlo; Ricciardi-Castagnoli, Paola

    2007-01-01

    Background Gene expression databases are key resources for microarray data management and analysis and the importance of a proper annotation of their content is well understood. Public repositories as well as microarray database systems that can be implemented by single laboratories exist. However, there is not yet a tool that can easily support a collaborative environment where different users with different rights of access to data can interact to define a common highly coherent content. The scope of the Genopolis database is to provide a resource that allows different groups performing microarray experiments related to a common subject to create a common coherent knowledge base and to analyse it. The Genopolis database has been implemented as a dedicated system for the scientific community studying dendritic and macrophage cells functions and host-parasite interactions. Results The Genopolis Database system allows the community to build an object based MIAME compliant annotation of their experiments and to store images, raw and processed data from the Affymetrix GeneChip® platform. It supports dynamical definition of controlled vocabularies and provides automated and supervised steps to control the coherence of data and annotations. It allows a precise control of the visibility of the database content to different sub groups in the community and facilitates exports of its content to public repositories. It provides an interactive users interface for data analysis: this allows users to visualize data matrices based on functional lists and sample characterization, and to navigate to other data matrices defined by similarity of expression values as well as functional characterizations of genes involved. A collaborative environment is also provided for the definition and sharing of functional annotation by users. Conclusion The Genopolis Database supports a community in building a common coherent knowledge base and analyse it. This fills a gap between a local

  19. Semantic Annotation of Mutable Data

    PubMed Central

    Morris, Robert A.; Dou, Lei; Hanken, James; Kelly, Maureen; Lowery, David B.; Ludäscher, Bertram; Macklin, James A.; Morris, Paul J.

    2013-01-01

    Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema. PMID:24223697

  20. NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity.

    PubMed

    Lu, Chang; Wang, Jun; Zhang, Zili; Yang, Pengyi; Yu, Guoxian

    2016-12-01

    Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions.

  1. Handling of missing data to improve the mining of large feed databases.

    PubMed

    Maroto-Molina, F; Gómez-Cabrera, A; Guerrero-Ginel, J E; Garrido-Varo, A; Sauvant, D; Tran, G; Heuzé, V; Pérez-Marín, D C

    2013-01-01

    Feed databases often have missing data. Despite their potentially major effect on data analysis (e.g., as a source of biased results and loss of statistical power), database managers and nutrition researchers have paid little attention to missing data. This study evaluated various methods of handling missing data using mining outputs from a database containing data on chemical composition and nutritive value for 18,864 alfalfa samples. A complete reference dataset was obtained comprising the 2,303 cases with no missing data for the attributes CP, crude fiber (CF), NDF, ADF and ADL. This dataset was used to simulate 2 types of missing data (at random and not at random), each with 2 loss intensities (33 and 66%), thus yielding a total of 4 incomplete datasets. Missing data from these datasets were handled using 2 deletion methods and 4 imputation methods, and outputs in terms of the identification and typing of alfalfa (using ANOVA and descriptive statistics) and of correlations between attributes (using regressions) were compared with outputs from the complete dataset. Imputation methods, particularly model-based versions, were found to perform better than deletion methods in terms of maximizing information use and minimizing bias although the extent of differences between methods depended on the type of missing data. The best approximation to the uncertainty value was provided by multiple imputation methods. It was concluded that the choice of the most suitable method for handling missing data depended both on the type of missing data and on the purpose of data analysis.

  2. Gene ontology annotation by density and gravitation models.

    PubMed

    Hou, Wen-Juan; Lin, Kevin Hsin-Yih; Chen, Hsin-Hsi

    2006-01-01

    Gene Ontology (GO) is developed to provide standard vocabularies of gene products in different databases. The process of annotating GO terms to genes requires curators to read through lengthy articles. Methods for speeding up or automating the annotation process are thus of great importance. We propose a GO annotation approach using full-text biomedical documents for directing more relevant papers to curators. This system explores word density and gravitation relationships between genes and GO terms. Different density and gravitation models are built and several evaluation criteria are employed to assess the effects of the proposed methods.

  3. An annotation system for 3D fluid flow visualization

    NASA Technical Reports Server (NTRS)

    Loughlin, Maria M.; Hughes, John F.

    1995-01-01

    Annotation is a key activity of data analysis. However, current systems for data analysis focus almost exclusively on visualization. We propose a system which integrates annotations into a visualization system. Annotations are embedded in 3D data space, using the Post-it metaphor. This embedding allows contextual-based information storage and retrieval, and facilitates information sharing in collaborative environments. We provide a traditional database filter and a Magic Lens filter to create specialized views of the data. The system has been customized for fluid flow applications, with features which allow users to store parameters of visualization tools and sketch 3D volumes.

  4. Mouse genome database 2016

    PubMed Central

    Bult, Carol J.; Eppig, Janan T.; Blake, Judith A.; Kadin, James A.; Richardson, Joel E.

    2016-01-01

    The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the primary community model organism database for the laboratory mouse and serves as the source for key biological reference data related to mouse genes, gene functions, phenotypes and disease models with a strong emphasis on the relationship of these data to human biology and disease. As the cost of genome-scale sequencing continues to decrease and new technologies for genome editing become widely adopted, the laboratory mouse is more important than ever as a model system for understanding the biological significance of human genetic variation and for advancing the basic research needed to support the emergence of genome-guided precision medicine. Recent enhancements to MGD include new graphical summaries of biological annotations for mouse genes, support for mobile access to the database, tools to support the annotation and analysis of sets of genes, and expanded support for comparative biology through the expansion of homology data. PMID:26578600

  5. Mouse genome database 2016.

    PubMed

    Bult, Carol J; Eppig, Janan T; Blake, Judith A; Kadin, James A; Richardson, Joel E

    2016-01-04

    The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the primary community model organism database for the laboratory mouse and serves as the source for key biological reference data related to mouse genes, gene functions, phenotypes and disease models with a strong emphasis on the relationship of these data to human biology and disease. As the cost of genome-scale sequencing continues to decrease and new technologies for genome editing become widely adopted, the laboratory mouse is more important than ever as a model system for understanding the biological significance of human genetic variation and for advancing the basic research needed to support the emergence of genome-guided precision medicine. Recent enhancements to MGD include new graphical summaries of biological annotations for mouse genes, support for mobile access to the database, tools to support the annotation and analysis of sets of genes, and expanded support for comparative biology through the expansion of homology data.

  6. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  7. Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.

    PubMed

    Lees, Jonathan G; Lee, David; Studer, Romain A; Dawson, Natalie L; Sillitoe, Ian; Das, Sayoni; Yeats, Corin; Dessailly, Benoit H; Rentzsch, Robert; Orengo, Christine A

    2014-01-01

    Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.

  8. MIPS: analysis and annotation of genome information in 2007.

    PubMed

    Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  9. The GATO gene annotation tool for research laboratories.

    PubMed

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.

  10. ProbMetab: an R package for Bayesian probabilistic annotation of LC-MS-based metabolomics.

    PubMed

    Silva, Ricardo R; Jourdan, Fabien; Salvanha, Diego M; Letisse, Fabien; Jamin, Emilien L; Guidetti-Gonzalez, Simone; Labate, Carlos A; Vêncio, Ricardo Z N

    2014-05-01

    We present ProbMetab, an R package that promotes substantial improvement in automatic probabilistic liquid chromatography-mass spectrometry-based metabolome annotation. The inference engine core is based on a Bayesian model implemented to (i) allow diverse source of experimental data and metadata to be systematically incorporated into the model with alternative ways to calculate the likelihood function and (ii) allow sensitive selection of biologically meaningful biochemical reaction databases as Dirichlet-categorical prior distribution. Additionally, to ensure result interpretation by system biologists, we display the annotation in a network where observed mass peaks are connected if their candidate metabolites are substrate/product of known biochemical reactions. This graph can be overlaid with other graph-based analysis, such as partial correlation networks, in a visualization scheme exported to Cytoscape, with web and stand-alone versions.

  11. Injectors and Annotations

    NASA Technical Reports Server (NTRS)

    Filman, Robert E.

    2004-01-01

    In a previous paper, we presented the Object Infrastructure Framework. The goal of that system is to simplify the creation of distributed applications. The primary claim of that work is that non-functional 'ilities' could be achieved by controlling and manipulating the communications between components, thereby simplifying the development of distributed systems. A secondary element of that paper is to argue for extending the conventional distributed objects model in two important ways: 1) The ability to insert injectors (filters, wrappers) into the communication path between components; 2) The ability to annotate communications with additional information, and to propagate these annotations through an application. Here we express the descriptions of that paper.

  12. On the Dynamic RSS Feedbacks of Indoor Fingerprinting Databases for Localization Reliability Improvement.

    PubMed

    Wen, Xiaoyang; Tao, Wenyuan; Own, Chung-Ming; Pan, Zhenjiang

    2016-08-15

    Location data is one of the most widely used context data types in context-aware and ubiquitous computing applications. To support locating applications in indoor environments, numerous systems with different deployment costs and positioning accuracies have been developed over the past decade. One useful method, based on received signal strength (RSS), provides a set of signal transmission access points. However, compiling a remeasurement RSS database involves a high cost, which is impractical in dynamically changing environments, particularly in highly crowded areas. In this study, we propose a dynamic estimation resampling method for certain locations chosen from a set of remeasurement fingerprinting databases. Our proposed method adaptively applies different, newly updated and offline fingerprinting points according to the temporal and spatial strength of the location. To achieve accuracy within a simulated area, the proposed method requires approximately 3% of the feedback to attain a double correctness probability comparable to similar methods; in a real environment, our proposed method can obtain excellent 1 m accuracy errors in the positioning system.

  13. On the Dynamic RSS Feedbacks of Indoor Fingerprinting Databases for Localization Reliability Improvement

    PubMed Central

    Wen, Xiaoyang; Tao, Wenyuan; Own, Chung-Ming; Pan, Zhenjiang

    2016-01-01

    Location data is one of the most widely used context data types in context-aware and ubiquitous computing applications. To support locating applications in indoor environments, numerous systems with different deployment costs and positioning accuracies have been developed over the past decade. One useful method, based on received signal strength (RSS), provides a set of signal transmission access points. However, compiling a remeasurement RSS database involves a high cost, which is impractical in dynamically changing environments, particularly in highly crowded areas. In this study, we propose a dynamic estimation resampling method for certain locations chosen from a set of remeasurement fingerprinting databases. Our proposed method adaptively applies different, newly updated and offline fingerprinting points according to the temporal and spatial strength of the location. To achieve accuracy within a simulated area, the proposed method requires approximately 3% of the feedback to attain a double correctness probability comparable to similar methods; in a real environment, our proposed method can obtain excellent 1 m accuracy errors in the positioning system. PMID:27537879

  14. Global Mapping of Traditional Chinese Medicine into Bioactivity Space and Pathways Annotation Improves Mechanistic Understanding and Discovers Relationships between Therapeutic Action (Sub)classes

    PubMed Central

    Mohamad Zobir, Siti Zuraidah; Mohd Fauzi, Fazlin; Liggi, Sonia; Drakakis, Georgios; Fu, Xianjun; Fan, Tai-Ping; Bender, Andreas

    2016-01-01

    Traditional Chinese medicine (TCM) still needs more scientific rationale to be proven for it to be accepted further in the West. We are now in the position to propose computational hypotheses for the mode-of-actions (MOAs) of 45 TCM therapeutic action (sub)classes from in silico target prediction algorithms, whose target was later annotated with Kyoto Encyclopedia of Genes and Genomes pathway, and to discover the relationship between them by generating a hierarchical clustering. The results of 10,749 TCM compounds showed 183 enriched targets and 99 enriched pathways from Estimation Score ≤ 0 and ≥ 5% of compounds/targets in a (sub)class. The MOA of a (sub)class was established from supporting literature. Overall, the most frequent top three enriched targets/pathways were immune-related targets such as tyrosine-protein phosphatase nonreceptor type 2 (PTPN2) and digestive system such as mineral absorption. We found two major protein families, G-protein coupled receptor (GPCR), and protein kinase family contributed to the diversity of the bioactivity space, while digestive system was consistently annotated pathway motif, which agreed with the important treatment principle of TCM, “the foundation of acquired constitution” that includes spleen and stomach. In short, the TCM (sub)classes, in many cases share similar targets/pathways despite having different indications. PMID:26989424

  15. The Lung TIME: annotated lung nodule dataset and nodule detection framework

    NASA Astrophysics Data System (ADS)

    Dolejsi, Martin; Kybic, Jan; Polovincak, Michal; Tuma, Stanislav

    2009-02-01

    The Lung Test Images from Motol Environment (Lung TIME) is a new publicly available dataset of thoracic CT scans with manually annotated pulmonary nodules. It is larger than other publicly available datasets. Pulmonary nodules are lesions in the lungs, which may indicate lung cancer. Their early detection significantly improves survival rate of patients. Automatic nodule detecting systems using CT scans are being developed to reduce physicians' load and to improve detection quality. Besides presenting our own nodule detection system, in this article, we mainly address the problem of testing and comparison of automatic nodule detection methods. Our publicly available 157 CT scan dataset with 394 annotated nodules contains almost every nodule types (pleura attached, vessel attached, solitary, regular, irregular) with 2-10mm in diameter, except ground glass opacities (GGO). Annotation was done consensually by two experienced radiologists. The data are in DICOM format, annotations are provided in XML format compatible with the Lung Imaging Database Consortium (LIDC). Our computer aided diagnosis system (CAD) is based on mathematical morphology and filtration with a subsequent classification step. We use Asymmetric AdaBoost classifier. The system was tested using TIME, LIDC and ANODE09 databases. The performance was evaluated by cross-validation for Lung TIME and LIDC, and using the supplied evaluation procedure for ANODE09. The sensitivity at chosen working point was 94.27% with 7.57 false positives/slice for TIME and LIDC datasets combined, 94.03% with 5.46 FPs/slice for the Lung TIME, 89.62% sensitivity with 12.03 FPs/slice for LIDC, and 78.68% with 4,61 FPs/slice when applied on ANODE09.

  16. Ontological Annotation with WordNet

    SciTech Connect

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob; Hohimer, Ryan E.; White, Amanda M.

    2006-06-06

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  17. Automating Ontological Annotation with WordNet

    SciTech Connect

    Sanfilippo, Antonio P.; Tratz, Stephen C.; Gregory, Michelle L.; Chappell, Alan R.; Whitney, Paul D.; Posse, Christian; Paulson, Patrick R.; Baddeley, Bob L.; Hohimer, Ryan E.; White, Amanda M.

    2006-01-22

    Semantic Web applications require robust and accurate annotation tools that are capable of automating the assignment of ontological classes to words in naturally occurring text (ontological annotation). Most current ontologies do not include rich lexical databases and are therefore not easily integrated with word sense disambiguation algorithms that are needed to automate ontological annotation. WordNet provides a potentially ideal solution to this problem as it offers a highly structured lexical conceptual representation that has been extensively used to develop word sense disambiguation algorithms. However, WordNet has not been designed as an ontology, and while it can be easily turned into one, the result of doing this would present users with serious practical limitations due to the great number of concepts (synonym sets) it contains. Moreover, mapping WordNet to an existing ontology may be difficult and requires substantial labor. We propose to overcome these limitations by developing an analytical platform that (1) provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, (2) leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances, and (3) integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models.

  18. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements.

    PubMed

    Altenhoff, Adrian M; Škunca, Nives; Glover, Natasha; Train, Clément-Marie; Sueki, Anna; Piližota, Ivana; Gori, Kevin; Tomiczek, Bartlomiej; Müller, Steven; Redestig, Henning; Gonnet, Gaston H; Dessimoz, Christophe

    2015-01-01

    The Orthologous Matrix (OMA) project is a method and associated database inferring evolutionary relationships amongst currently 1706 complete proteomes (i.e. the protein sequence associated for every protein-coding gene in all genomes). In this update article, we present six major new developments in OMA: (i) a new web interface; (ii) Gene Ontology function predictions as part of the OMA pipeline; (iii) better support for plant genomes and in particular homeologs in the wheat genome; (iv) a new synteny viewer providing the genomic context of orthologs; (v) statically computed hierarchical orthologous groups subsets downloadable in OrthoXML format; and (vi) possibility to export parts of the all-against-all computations and to combine them with custom data for 'client-side' orthology prediction. OMA can be accessed through the OMA Browser and various programmatic interfaces at http://omabrowser.org.

  19. Improving the Analysis, Storage and Sharing of Neuroimaging Data using Relational Databases and Distributed Computing

    PubMed Central

    Hasson, Uri; Skipper, Jeremy I.; Wilde, Michael J.; Nusbaum, Howard C.; Small, Steven L.

    2007-01-01

    The increasingly complex research questions addressed by neuroimaging research impose substantial demands on computational infrastructures. These infrastructures need to support management of massive amounts of data in a way that affords rapid and precise data analysis, to allow collaborative research, and to achieve these aims securely and with minimum management overhead. Here we present an approach that overcomes many current limitations in data analysis and data sharing. This approach is based on open source database management systems that support complex data queries as an integral part of data analysis, flexible data sharing, and parallel and distributed data processing using cluster computing and Grid computing resources. We assess the strengths of these approaches as compared to current frameworks based on storage of binary or text files. We then describe in detail the implementation of such a system and provide a concrete description of how it was used to enable a complex analysis of fMRI time series data. PMID:17964812

  20. Improving the analysis, storage and sharing of neuroimaging data using relational databases and distributed computing.

    PubMed

    Hasson, Uri; Skipper, Jeremy I; Wilde, Michael J; Nusbaum, Howard C; Small, Steven L

    2008-01-15

    The increasingly complex research questions addressed by neuroimaging research impose substantial demands on computational infrastructures. These infrastructures need to support management of massive amounts of data in a way that affords rapid and precise data analysis, to allow collaborative research, and to achieve these aims securely and with minimum management overhead. Here we present an approach that overcomes many current limitations in data analysis and data sharing. This approach is based on open source database management systems that support complex data queries as an integral part of data analysis, flexible data sharing, and parallel and distributed data processing using cluster computing and Grid computing resources. We assess the strengths of these approaches as compared to current frameworks based on storage of binary or text files. We then describe in detail the implementation of such a system and provide a concrete description of how it was used to enable a complex analysis of fMRI time series data.

  1. A Web 2.0-Based Collaborative Annotation System for Enhancing Knowledge Sharing in Collaborative Learning Environments

    ERIC Educational Resources Information Center

    Su, Addison Y. S.; Yang, Stephen J. H.; Hwang, Wu-Yuin; Zhang, Jia

    2010-01-01

    A limitation of current Web-based collaborative learning is the restricted ability of students to create and share individual annotations with annotated documents. Applying Web 2.0 collaborative annotation systems and analyzing students' annotation behavior has attracted attention to improve collaborative learning. This study designed a…

  2. Annotated Bibliography. First Edition.

    ERIC Educational Resources Information Center

    Haring, Norris G.

    An annotated bibliography which presents approximately 300 references from 1951 to 1973 on the education of severely/profoundly handicapped persons. Citations are grouped alphabetically by author's name within the following categories: characteristics and treatment, gross motor development, sensory and motor development, physical therapy for the…

  3. Ghostwriting: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Simmons, Donald B.

    Drawn from communication journals, historical and news magazines, business and industrial magazines, political science and world affairs journals, general interest periodicals, and literary and political review magazines, the approximately 90 entries in this annotated bibliography discuss ghostwriting as practiced through the ages and reveal the…

  4. Automated Microbial Genome Annotation

    SciTech Connect

    Land, Miriam

    2009-05-29

    Miriam Land of the DOE Joint Genome Institute at Oak Ridge National Laboratory gives a talk on the current state and future challenges of moving toward automated microbial genome annotation at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM

  5. Cheating. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Wildemuth, Barbara M., Comp.

    This 89-item, annotated bibliography was compiled to provide access to research and discussions of cheating and, specifically, cheating on tests. It is not limited to any educational level, nor is it confined to any specific curriculum area. Two data bases were searched by computer, and a library search was conducted. A computer search of the…

  6. Current and future trends in marine image annotation software

    NASA Astrophysics Data System (ADS)

    Gomes-Pereira, Jose Nuno; Auger, Vincent; Beisiegel, Kolja; Benjamin, Robert; Bergmann, Melanie; Bowden, David; Buhl-Mortensen, Pal; De Leo, Fabio C.; Dionísio, Gisela; Durden, Jennifer M.; Edwards, Luke; Friedman, Ariell; Greinert, Jens; Jacobsen-Stout, Nancy; Lerner, Steve; Leslie, Murray; Nattkemper, Tim W.; Sameoto, Jessica A.; Schoening, Timm; Schouten, Ronald; Seager, James; Singh, Hanumant; Soubigou, Olivier; Tojeira, Inês; van den Beld, Inge; Dias, Frederico; Tempera, Fernando; Santos, Ricardo S.

    2016-12-01

    Given the need to describe, analyze and index large quantities of marine imagery data for exploration and monitoring activities, a range of specialized image annotation tools have been developed worldwide. Image annotation - the process of transposing objects or events represented in a video or still image to the semantic level, may involve human interactions and computer-assisted solutions. Marine image annotation software (MIAS) have enabled over 500 publications to date. We review the functioning, application trends and developments, by comparing general and advanced features of 23 different tools utilized in underwater image analysis. MIAS requiring human input are basically a graphical user interface, with a video player or image browser that recognizes a specific time code or image code, allowing to log events in a time-stamped (and/or geo-referenced) manner. MIAS differ from similar software by the capability of integrating data associated to video collection, the most simple being the position coordinates of the video recording platform. MIAS have three main characteristics: annotating events in real time, posteriorly to annotation and interact with a database. These range from simple annotation interfaces, to full onboard data management systems, with a variety of toolboxes. Advanced packages allow to input and display data from multiple sensors or multiple annotators via intranet or internet. Posterior human-mediated annotation often include tools for data display and image analysis, e.g. length, area, image segmentation, point count; and in a few cases the possibility of browsing and editing previous dive logs or to analyze the annotations. The interaction with a database allows the automatic integration of annotations from different surveys, repeated annotation and collaborative annotation of shared datasets, browsing and querying of data. Progress in the field of automated annotation is mostly in post processing, for stable platforms or still images

  7. NOAA's Integrated Tsunami Database: Data for improved forecasts, warnings, research, and risk assessments

    NASA Astrophysics Data System (ADS)

    Stroker, Kelly; Dunbar, Paula; Mungov, George; Sweeney, Aaron; McCullough, Heather; Carignan, Kelly

    2015-04-01

    The National Oceanic and Atmospheric Administration (NOAA) has primary responsibility in the United States for tsunami forecast, warning, research, and supports community resiliency. NOAA's National Geophysical Data Center (NGDC) and co-located World Data Service for Geophysics provide a unique collection of data enabling communities to ensure preparedness and resilience to tsunami hazards. Immediately following a damaging or fatal tsunami event there is a need for authoritative data and information. The NGDC Global Historical Tsunami Database (http://www.ngdc.noaa.gov/hazard/) includes all tsunami events, regardless of intensity, as well as earthquakes and volcanic eruptions that caused fatalities, moderate damage, or generated a tsunami. The long-term data from these events, including photographs of damage, provide clues to what might happen in the future. NGDC catalogs the information on global historical tsunamis and uses these data to produce qualitative tsunami hazard assessments at regional levels. In addition to the socioeconomic effects of a tsunami, NGDC also obtains water level data from the coasts and the deep-ocean at stations operated by the NOAA/NOS Center for Operational Oceanographic Products and Services, the NOAA Tsunami Warning Centers, and the National Data Buoy Center (NDBC) and produces research-quality data to isolate seismic waves (in the case of the deep-ocean sites) and the tsunami signal. These water-level data provide evidence of sea-level fluctuation and possible inundation events. NGDC is also building high-resolution digital elevation models (DEMs) to support real-time forecasts, implemented at 75 US coastal communities. After a damaging or fatal event NGDC begins to collect and integrate data and information from many organizations into the hazards databases. Sources of data include our NOAA partners, the U.S. Geological Survey, the UNESCO Intergovernmental Oceanographic Commission (IOC) and International Tsunami Information Center

  8. IMPROVING EMISSIONS ESTIMATES WITH COMPUTATIONAL INTELLIGENCE, DATABASE EXPANSION, AND COMPREHENSIVE VALIDATION

    EPA Science Inventory

    The report discusses an EPA investigation of techniques to improve methods for estimating volatile organic compound (VOC) emissions from area sources. Using the automobile refinishing industry for a detailed area source case study, an emission estimation method is being developed...

  9. Similarity landscapes: An improved method for scientific visualization of information from protein and DNA database searches

    SciTech Connect

    Dogget, N.; Myers, G.; Wills, C.J.

    1998-12-01

    This is the final report of a three-year, Laboratory Directed Research and Development (LDRD) project at the Los Alamos National Laboratory (LANL). The authors have used computer simulations and examination of a variety of databases to answer questions about a wide range of evolutionary questions. The authors have found that there is a clear distinction in the evolution of HIV-1 and HIV-2, with the former and more virulent virus evolving more rapidly at a functional level. The authors have discovered highly non-random patterns in the evolution of HIV-1 that can be attributed to a variety of selective pressures. In the course of examination of microsatellite DNA (short repeat regions) in microorganisms, the authors have found clear differences between prokaryotes and eukaryotes in their distribution, differences that can be tied to different selective pressures. They have developed a new method (topiary pruning) for enhancing the phylogenetic information contained in DNA sequences. Most recently, the authors have discovered effects in complex rainforest ecosystems that indicate strong frequency-dependent interactions between host species and their parasites, leading to the maintenance of ecosystem variability.

  10. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  11. Automated analysis and annotation of basketball video

    NASA Astrophysics Data System (ADS)

    Saur, Drew D.; Tan, Yap-Peng; Kulkarni, Sanjeev R.; Ramadge, Peter J.

    1997-01-01

    Automated analysis and annotation of video sequences are important for digital video libraries, content-based video browsing and data mining projects. A successful video annotation system should provide users with useful video content summary in a reasonable processing time. Given the wide variety of video genres available today, automatically extracting meaningful video content for annotation still remains hard by using current available techniques. However, a wide range video has inherent structure such that some prior knowledge about the video content can be exploited to improve our understanding of the high-level video semantic content. In this paper, we develop tools and techniques for analyzing structured video by using the low-level information available directly from MPEG compressed video. Being able to work directly in the video compressed domain can greatly reduce the processing time and enhance storage efficiency. As a testbed, we have developed a basketball annotation system which combines the low-level information extracted from MPEG stream with the prior knowledge of basketball video structure to provide high level content analysis, annotation and browsing for events such as wide- angle and close-up views, fast breaks, steals, potential shots, number of possessions and possession times. We expect our approach can also be extended to structured video in other domains.

  12. Nutrition & Adolescent Pregnancy: A Selected Annotated Bibliography.

    ERIC Educational Resources Information Center

    National Agricultural Library (USDA), Washington, DC.

    This annotated bibliography on nutrition and adolescent pregnancy is intended to be a source of technical assistance for nurses, nutritionists, physicians, educators, social workers, and other personnel concerned with improving the health of teenage mothers and their babies. It is divided into two major sections. The first section lists selected…

  13. Semia: semi-automatic interactive graphic editing tool to annotate ambulatory ECG records.

    PubMed

    Dorn, Roman; Jager, Franc

    2004-09-01

    We designed and developed a special purpose interactive graphic editing tool semi-automatic (Semia) to annotate transient ischaemic ST segment episodes and other non-ischaemic ST segment events in 24h ambulatory electrocardiogram (ECG) records. The tool allows representation and viewing of the data, interaction with the data globally and locally at different resolutions, examining data at any point, manual adjustment of heart-beat fiducial points, and manual and automatic editing of annotations. Efficient and fast display of ambulatory ECG signal waveforms, display of diagnostic and morphology feature-vector time-series, dynamic interface controls, and automated procedures to help annotate, made the tool efficient, user friendly and usable. Human expert annotators used the Semia tool to successfully annotate the Long-Term ST database (LTST DB), a result of a multinational effort. The tool supported paperless editing of annotations at dislocated geographical sites. We present design, characteristic "look and feel", functionality, and development of Semia annotating tool.

  14. Improving the Mapping of Smith-Waterman Sequence Database Searches onto CUDA-Enabled GPUs

    PubMed Central

    Huang, Liang-Tsung; Wu, Chao-Chin; Lai, Lien-Fu; Li, Yun-Ju

    2015-01-01

    Sequence alignment lies at heart of the bioinformatics. The Smith-Waterman algorithm is one of the key sequence search algorithms and has gained popularity due to improved implementations and rapidly increasing compute power. Recently, the Smith-Waterman algorithm has been successfully mapped onto the emerging general-purpose graphics processing units (GPUs). In this paper, we focused on how to improve the mapping, especially for short query sequences, by better usage of shared memory. We performed and evaluated the proposed method on two different platforms (Tesla C1060 and Tesla K20) and compared it with two classic methods in CUDASW++. Further, the performance on different numbers of threads and blocks has been analyzed. The results showed that the proposed method significantly improves Smith-Waterman algorithm on CUDA-enabled GPUs in proper allocation of block and thread numbers. PMID:26339591

  15. Apollo: a sequence annotation editor.

    PubMed

    Lewis, S E; Searle, S M J; Harris, N; Gibson, M; Lyer, V; Richter, J; Wiel, C; Bayraktaroglu, L; Birney, E; Crosby, M A; Kaminker, J S; Matthews, B B; Prochnik, S E; Smithy, C D; Tupy, J L; Rubin, G M; Misra, S; Mungall, C J; Clamp, M E

    2002-01-01

    The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects.

  16. Annotations of Mexican bullfighting videos for semantic index

    NASA Astrophysics Data System (ADS)

    Montoya Obeso, Abraham; Oropesa Morales, Lester Arturo; Fernando Vázquez, Luis; Cocolán Almeda, Sara Ivonne; Stoian, Andrei; García Vázquez, Mireya Saraí; Zamudio Fuentes, Luis Miguel; Montiel Perez, Jesús Yalja; de la O Torres, Saul; Ramírez Acosta, Alejandro Alvaro

    2015-09-01

    The video annotation is important for web indexing and browsing systems. Indeed, in order to evaluate the performance of video query and mining techniques, databases with concept annotations are required. Therefore, it is necessary generate a database with a semantic indexing that represents the digital content of the Mexican bullfighting atmosphere. This paper proposes a scheme to make complex annotations in a video in the frame of multimedia search engine project. Each video is partitioned using our segmentation algorithm that creates shots of different length and different number of frames. In order to make complex annotations about the video, we use ELAN software. The annotations are done in two steps: First, we take note about the whole content in each shot. Second, we describe the actions as parameters of the camera like direction, position and deepness. As a consequence, we obtain a more complete descriptor of every action. In both cases we use the concepts of the TRECVid 2014 dataset. We also propose new concepts. This methodology allows to generate a database with the necessary information to create descriptors and algorithms capable to detect actions to automatically index and classify new bullfighting multimedia content.

  17. Fuzzy Emotional Semantic Analysis and Automated Annotation of Scene Images

    PubMed Central

    Cao, Jianfang; Chen, Lichao

    2015-01-01

    With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation method for scene images based on fuzzy set theory. A fuzzy membership degree was calculated to describe the emotional degree of a scene image and was implemented using the Adaboost algorithm and a back-propagation (BP) neural network. The automated annotation method was trained and tested using scene images from the SUN Database. The annotation results were then compared with those based on artificial annotation. Our method showed an annotation accuracy rate of 91.2% for basic emotional values and 82.4% after extended emotional values were added, which correspond to increases of 5.5% and 8.9%, respectively, compared with the results from using a single BP neural network algorithm. Furthermore, the retrieval accuracy rate based on our method reached approximately 89%. This study attempts to lay a solid foundation for the automated emotional semantic annotation of more types of images and therefore is of practical significance. PMID:25838818

  18. A Conceptual Model for Describing Processes of Crop Improvement in Database Structures

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Rising research costs, broadening research goals, intellectual property rights, and other concerns have increased the need for robust approaches to manage data from crop improvement. In developing the International Crop Information System (ICIS), a flexible data model was developed to allow any conc...

  19. Managing and Querying Image Annotation and Markup in XML

    PubMed Central

    Wang, Fusheng; Pan, Tony; Sharma, Ashish; Saltz, Joel

    2010-01-01

    Proprietary approaches for representing annotations and image markup are serious barriers for researchers to share image data and knowledge. The Annotation and Image Markup (AIM) project is developing a standard based information model for image annotation and markup in health care and clinical trial environments. The complex hierarchical structures of AIM data model pose new challenges for managing such data in terms of performance and support of complex queries. In this paper, we present our work on managing AIM data through a native XML approach, and supporting complex image and annotation queries through native extension of XQuery language. Through integration with xService, AIM databases can now be conveniently shared through caGrid. PMID:21218167

  20. Managing and Querying Image Annotation and Markup in XML.

    PubMed

    Wang, Fusheng; Pan, Tony; Sharma, Ashish; Saltz, Joel

    2010-01-01

    Proprietary approaches for representing annotations and image markup are serious barriers for researchers to share image data and knowledge. The Annotation and Image Markup (AIM) project is developing a standard based information model for image annotation and markup in health care and clinical trial environments. The complex hierarchical structures of AIM data model pose new challenges for managing such data in terms of performance and support of complex queries. In this paper, we present our work on managing AIM data through a native XML approach, and supporting complex image and annotation queries through native extension of XQuery language. Through integration with xService, AIM databases can now be conveniently shared through caGrid.

  1. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    States, David J.

    2004-07-28

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  2. MIPS: analysis and annotation of proteins from whole genomes.

    PubMed

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  3. AGeS: A Software System for Microbial Genome Sequence Annotation

    DTIC Science & Technology

    2011-03-01

    source genome browser GBrowse [4] and a link to download annotated genomes in GenBank [14] format; and N free availability of the source code. Methods... ORPHEUS [43], WUBLAST [44], and FASTA [45] for predicting protein-coding regions and some InterPro databases for function annotation. In addition, Y

  4. a Radiocarbon Database for Improved Understanding of Global Soil Carbon Dynamics: Part II

    NASA Astrophysics Data System (ADS)

    Trumbore, S.; Torn, M. S.; Sierra, C. A.; Smith, L. J.; Nave, L. E.; Workshop Paritipants, R.

    2011-12-01

    We report results of a workshop to initiate a global database of radiocarbon measurements in soil and other ecosystem compartments. Radiocarbon provides critical information for understanding the rate of exchange of soil carbon with the atmosphere and hydrosphere. For example, radiocarbon has been used to demonstrate the importance of short range order minerals in stabilizing organic carbon on millennial timescales in some soils. On decadal to centennial timescales, the infiltration of 'bomb' radiocarbon provides a measure of the amount and nature of soil carbon that responds on the timescale of most human impacts. The radiocarbon sigature of chemically or physically fractionated soil, or even in specific organic compounds, can yield clues as to controls on organic matter cycling on a range of timescales. Radiocarbon in microbial biomass or respiration can be a sensitive indicator of shifts in substrate use with vegetation, nutrient availability or temperature change. Taken toghether, such measurements can provide critical tests for models of soil carbon dynamics, while patterns in soil C dynamics with edaphic factors can be used to help parameterize models at spatial scales ranging from profile to landscape to global. The advent and proliferation of accelerator mass spectrometry since the early 1990s has vastly increased the number of radiocarbon analyses carried out in soils. However, these studies have usually been carried out by individual investigators within specific sites or regions, and to date the results have not been assembled, interpreted or compared at larger spatial scales. Given the expense of radiocarbon measurements, and the need for global synthesis products to evaluate and/or develop models of soil carbon response to climate and land use changes across a range of spatial scales, our goals are to: (1) bring together in one place existing radiocarbon measurements and provide a continuing common repository for new analyses; (2) supply ancillary

  5. Reflective Database Access Control

    ERIC Educational Resources Information Center

    Olson, Lars E.

    2009-01-01

    "Reflective Database Access Control" (RDBAC) is a model in which a database privilege is expressed as a database query itself, rather than as a static privilege contained in an access control list. RDBAC aids the management of database access controls by improving the expressiveness of policies. However, such policies introduce new interactions…

  6. Rfam 12.0: updates to the RNA families database

    PubMed Central

    Nawrocki, Eric P.; Burge, Sarah W.; Bateman, Alex; Daub, Jennifer; Eberhardt, Ruth Y.; Eddy, Sean R.; Floden, Evan W.; Gardner, Paul P.; Jones, Thomas A.; Tate, John; Finn, Robert D.

    2015-01-01

    The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families. PMID:25392425

  7. Effects of Annotations and Homework on Learning Achievement: An Empirical Study of Scratch Programming Pedagogy

    ERIC Educational Resources Information Center

    Su, Addison Y. S.; Huang, Chester S. J.; Yang, Stephen J. H.; Ding, T. J.; Hsieh, Y. Z.

    2015-01-01

    In Taiwan elementary schools, Scratch programming has been taught for more than four years. Previous studies have shown that personal annotations is a useful learning method that improve learning performance. An annotation-based Scratch programming (ASP) system provides for the creation, share, and review of annotations and homework solutions in…

  8. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  9. neXtA5: accelerating annotation of articles via automated approaches in neXtProt

    PubMed Central

    Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick

    2016-01-01

    The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein–protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being

  10. iPathCons and iPathDB: an improved insect pathway construction tool and the database

    PubMed Central

    Zhang, Zan; Yin, Chuanlin; Liu, Ying; Jie, Wencai; Lei, Wenjie; Li, Fei

    2014-01-01

    Insects are one of the most successful animal groups on earth. Some insects, such as the silkworm and honeybee, are beneficial to humans, whereas others are notorious pests of crops. At present, the genomes of 38 insects have been sequenced and made publically available. In addition, the transcriptomes of dozens of insects have been sequenced. As gene data rapidly accumulate, constructing the pathway of molecular interactions becomes increasingly important for entomological research. Here, we developed an improved tool, iPathCons, for knowledge-based construction of pathways from the transcriptomes or the official gene sets of genomes. Considering the high evolution diversity in insects, iPathCons uses a voting system for Kyoto Encyclopedia of Genes and Genomes Orthology assignment. Both stand-alone software and a web server of iPathCons are provided. Using iPathCons, we constructed the pathways of molecular interactions of 52 insects, including 37 genome-sequenced and 15 transcriptome-sequenced ones. These pathways are available in the iPathDB, which provides searches, web server, data downloads, etc. This database will be highly useful for the insect research community. Database URL: http://ento.njau.edu.cn/ipath/ PMID:25388589

  11. Evaluation of current methods used to analyze the expression profiles of ABC transporters yields an improved drug-discovery database

    PubMed Central

    Orina, Josiah N.; Calcagno, Anna Maria; Wu, Chung-Pu; Varma, Sudhir; Shih, Joanna; Lin, Min; Eichler, Gabriel; Weinstein, John N.; Pommier, Yves; Ambudkar, Suresh V.; Gottesman, Michael M.; Gillet, Jean-Pierre

    2009-01-01

    The development of multidrug resistance (MDR) to chemotherapy remains a major challenge in the treatment of cancer. Resistance exists against every effective anti-cancer drug and can develop by multiple mechanisms. These mechanisms can act individually or synergistically, leading to multidrug resistance (MDR), in which the cell becomes resistant to a variety of structurally and mechanistically unrelated drugs in addition to the drug initially administered. Although extensive work has been done to characterize MDR mechanisms in vitro, the translation of this knowledge to the clinic has not been successful. Therefore, identifying genes and mechanisms critical to the development of MDR in vivo and establishing a reliable method for analyzing highly homologous genes from small amounts of tissue is fundamental to achieving any significant enhancement in our understanding of multidrug resistance mechanisms and could lead to treatments designed to circumvent it. In this study, we use a previously established database that allows the identification of lead compounds in the early stages of drug discovery that are not ABC transporter substrates. We believe this can serve as a model for appraising the accuracy and sensitivity of current methods used to analyze the expression profiles of ABC transporters. We found two platforms to be superior methods for the analysis of expression profiles of highly homologous gene superfamilies. This study also led to an improved database by revealing previously unidentified substrates for ABCB1, ABCC1 and ABCG2, transporters that contribute to multidrug resistance. PMID:19584229

  12. MEPD: medaka expression pattern database, genes and more

    PubMed Central

    Alonso-Barba, Juan I.; Rahman, Raza-Ur; Wittbrodt, Joachim; Mateo, Juan L.

    2016-01-01

    The Medaka Expression Pattern Database (MEPD; http://mepd.cos.uni-heidelberg.de/) is designed as a repository of medaka expression data for the scientific community. In this update we present two main improvements. First, we have changed the previous clone-centric view for in situ data to a gene-centric view. This is possible because now we have linked all the data present in MEPD to the medaka gene annotation in ENSEMBL. In addition, we have also connected the medaka genes in MEPD to their corresponding orthologous gene in zebrafish, again using the ENSEMBL database. Based on this, we provide a link to the Zebrafish Model Organism Database (ZFIN) to allow researches to compare expression data between these two fish model organisms. As a second major improvement, we have modified the design of the database to enable it to host regulatory elements, promoters or enhancers, expression patterns in addition to gene expression. The combination of gene expression, by traditional in situ, and regulatory element expression, typically by fluorescence reporter gene, within the same platform assures consistency in terms of annotation. In our opinion, this will allow researchers to uncover new insights between the expression domain of genes and their regulatory landscape. PMID:26450962

  13. The Male Sex Role: A Selected and Annotated Bibliography.

    ERIC Educational Resources Information Center

    Grady, Kathleen E.; And Others

    This bibliography, containing more than 250 entries, presents research and theoretical perspectives into the male sex role. Articles were chosen for their usefulness to researchers, with emphasis on scientific and data-based research literature. All the annotations use a standard format including subjects, method, findings and comments. Articles…

  14. Annotating images by mining image search results.

    PubMed

    Wang, Xin-Jing; Zhang, Lei; Li, Xirong; Ma, Wei-Ying

    2008-11-01

    Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged-one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.

  15. An improved tropospheric ozone database retrieved from SCIAMACHY Limb-Nadir-Matching method

    NASA Astrophysics Data System (ADS)

    Jia, Jia; Rozanov, Alexei; Ladstätter-Weißenmayer, Annette; Ebojie, Felix; Rahpoe, Nabiz; Bötel, Stefan; Burrows, John

    2015-04-01

    Tropospheric ozone is one of the most important green-house gases and the main component of photochemical smog. It is either transported from the stratosphere or photochemically produced during pollution events in the troposphere that threaten the respiratory system. To investigate sources, transport mechanisms of tropospheric ozone in a global view, limb nadir matching (LNM) technique applied with SCIAMACHY instrument is used to retrieve tropospheric ozone. With the fact that 90% ozone is located in the stratosphere and only about 10% can be observed in the troposphere, the usage of satellite data requires highly qualified nadir and limb data. In this study we show an improvement of SCIAMACHY limb data as well as its influence on tropospheric ozone results. The limb nadir matching technique is also refined to increase the quality of the tropospheric ozone. The results are validated with ozone sonde measurements.

  16. GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array.

    PubMed

    Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

    2014-01-01

    DNA sequences are translated into protein coding sequences and then further assigned to protein families in metagenomic analyses, because of the need for sensitivity. However, huge amounts of sequence data create the problem that even general homology search analyses using BLASTX become difficult in terms of computational cost. We designed a new homology search algorithm that finds seed sequences based on the suffix arrays of a query and a database, and have implemented it as GHOSTX. GHOSTX achieved approximately 131-165 times acceleration over a BLASTX search at similar levels of sensitivity. GHOSTX is distributed under the BSD 2-clause license and is available for download at http://www.bi.cs.titech.ac.jp/ghostx/. Currently, sequencing technology continues to improve, and sequencers are increasingly producing larger and larger quantities of data. This explosion of sequence data makes computational analysis with contemporary tools more difficult. We offer this tool as a potential solution to this problem.

  17. GANESH: Software for Customized Annotation of Genome Regions

    PubMed Central

    Huntley, Derek; Hummerich, Holger; Smedley, Damian; Kittivoravitkul, Sasivimol; McCarthy, Mark; Little, Peter; Sergot, Marek

    2003-01-01

    GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes. PMID:12952886

  18. Hawaii bibliographic database

    NASA Astrophysics Data System (ADS)

    Wright, Thomas L.; Takahashi, Taeko Jane

    The Hawaii bibliographic database has been created to contain all of the literature, from 1779 to the present, pertinent to the volcanological history of the Hawaiian-Emperor volcanic chain. References are entered in a PC- and Macintosh-compatible EndNote Plus bibliographic database with keywords and s or (if no ) with annotations as to content. Keywords emphasize location, discipline, process, identification of new chemical data or age determinations, and type of publication. The database is updated approximately three times a year and is available to upload from an ftp site. The bibliography contained 8460 references at the time this paper was submitted for publication. Use of the database greatly enhances the power and completeness of library searches for anyone interested in Hawaiian volcanism.

  19. Hawaii bibliographic database

    USGS Publications Warehouse

    Wright, T.L.; Takahashi, T.J.

    1998-01-01

    The Hawaii bibliographic database has been created to contain all of the literature, from 1779 to the present, pertinent to the volcanological history of the Hawaiian-Emperor volcanic chain. References are entered in a PC- and Macintosh-compatible EndNote Plus bibliographic database with keywords and abstracts or (if no abstract) with annotations as to content. Keywords emphasize location, discipline, process, identification of new chemical data or age determinations, and type of publication. The database is updated approximately three times a year and is available to upload from an ftp site. The bibliography contained 8460 references at the time this paper was submitted for publication. Use of the database greatly enhances the power and completeness of library searches for anyone interested in Hawaiian volcanism.

  20. Evaluation of medication list completeness, safety, and annotations.

    PubMed

    Owen, Michael C; Chang, Nancy M; Chong, David H; Vawdrey, David K

    2011-01-01

    Clinical documents frequently contain a list of a patient's medications. Missing information about the dosage, route, or frequency of a medication impairs clinical communication and may harm patients. We examined 253 medication lists. There were 181 lists (72%) with at least one medication missing a dose, route, or frequency. Missing information was judged to be potentially harmful in 47 of the lists (19% of 253) by three physician reviewers (kappa=0.69). We also observed that many lists contained additional information included as annotations, prompting a secondary thematic analysis of the annotations. Fifty-five of the 253 lists (22%) contained one or more annotations. The most frequent types of annotations were comments about the patient's medical history, the clinician's treatment plan changes, and the patient's adherence to a medication. Future development of electronic medication reconciliation tools to improve medication list completeness should also support annotating the medication list in a flexible manner.

  1. Controlled annotations for systems biology.

    PubMed

    Juty, Nick; Laibe, Camille; Le Novère, Nicolas

    2013-01-01

    The aim of this chapter is to provide sufficient information to enable a reader, new to the subject of Systems Biology, to create and use effectively controlled annotations, using resolvable Identifiers.org Uniform Resource Identifiers (URIs). The text details the underlying requirements that have led to the development of such an identification scheme and infrastructure, the principles that underpin its syntax and the benefits derived through its use. It also places into context the relationship with other standardization efforts, how it differs from other pre-existing identification schemes, recent improvements to the system, as well as those that are planned in the future. Throughout, the reader is provided with explicit examples of use and directed to supplementary information where necessary.

  2. RATT: Rapid Annotation Transfer Tool

    PubMed Central

    Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew

    2011-01-01

    Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991

  3. Nonlinear Deep Kernel Learning for Image Annotation.

    PubMed

    Jiu, Mingyuan; Sahbi, Hichem

    2017-02-08

    Multiple kernel learning (MKL) is a widely used technique for kernel design. Its principle consists in learning, for a given support vector classifier, the most suitable convex (or sparse) linear combination of standard elementary kernels. However, these combinations are shallow and often powerless to capture the actual similarity between highly semantic data, especially for challenging classification tasks such as image annotation. In this paper, we redefine multiple kernels using deep multi-layer networks. In this new contribution, a deep multiple kernel is recursively defined as a multi-layered combination of nonlinear activation functions, each one involves a combination of several elementary or intermediate kernels, and results into a positive semi-definite deep kernel. We propose four different frameworks in order to learn the weights of these networks: supervised, unsupervised, kernel-based semisupervised and Laplacian-based semi-supervised. When plugged into support vector machines (SVMs), the resulting deep kernel networks show clear gain, compared to several shallow kernels for the task of image annotation. Extensive experiments and analysis on the challenging ImageCLEF photo annotation benchmark, the COREL5k database and the Banana dataset validate the effectiveness of the proposed method.

  4. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  5. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

  6. Technostress: Surviving a Database Crash.

    ERIC Educational Resources Information Center

    Dobb, Linda S.

    1990-01-01

    Discussion of technostress in libraries focuses on a database crash at California Polytechnic State University, San Luis Obispo. Steps taken to restore the data are explained, strategies for handling technological accidents are suggested, the impact on library staff is discussed, and a 10-item annotated bibliography on technostress is provided.…

  7. GOblet: a platform for Gene Ontology annotation of anonymous sequence data

    PubMed Central

    Groth, Detlef; Lehrach, Hans; Hennig, Steffen

    2004-01-01

    GOblet is a comprehensive web server application providing the annotation of anonymous sequence data with Gene Ontology (GO) terms. It uses a variety of different protein databases (human, murines, invertebrates, plants, sp-trembl) and their respective GO mappings. The user selects the appropriate database and alignment threshold and thereafter submits single or multiple nucleotide or protein sequences. Results are shown in different ways, e.g. as survey statistics for the main GO categories for all sequences or as detailed results for each single sequence that has been submitted. In its newest version, GOblet allows the batch submission of sequences and provides an improved display of results with the aid of Java applets. All output data, together with the Java applet, are packed to a downloadable archive for local installation and analysis. GOblet can be accessed freely at http://goblet.molgen.mpg.de. PMID:15215401

  8. EST-PAC a web package for EST annotation and protein sequence prediction.

    PubMed

    Strahm, Yvan; Powell, David; Lefèvre, Christophe

    2006-10-12

    With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics.

  9. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  10. Dizeez: an online game for human gene-disease annotation.

    PubMed

    Loguercio, Salvatore; Good, Benjamin M; Su, Andrew I

    2013-01-01

    Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org.

  11. A legume specific protein database (LegProt) improves the number of identified peptides, confidence scores and overall protein identification success rates for legume proteomics.

    PubMed

    Lei, Zhentian; Dai, Xinbin; Watson, Bonnie S; Zhao, Patrick X; Sumner, Lloyd W

    2011-07-01

    A legume specific protein database (LegProt) has been created containing sequences from seven legume species, i.e., Glycine max, Lotus japonicus, Medicago sativa, Medicago truncatula, Lupinusalbus, Phaseolus vulgaris, and Pisum sativum. The database consists of amino acid sequences translated from predicted gene models and 6-frame translations of tentative consensus (TC) sequences assembled from expressed sequence tags (ESTs) and singleton ESTs. This database was queried using mass spectral data for protein identification and identification success rates were compared to the NCBI nr database. Specifically, Mascot MS/MS ion searches of tandem nano-LC Q-TOFMS/MS mass spectral data showed that relative to the NCBI nr protein database, the LegProt database yielded a 54% increase in the average protein score (i.e., from NCBI nr 480 to LegProt 739) and a 50% increase in the average number of matched peptides (i.e., from NCBI nr 8 to LegProt 12). The overall identification success rate also increased from 88% (NCBI nr) to 93% (LegProt). Mascot peptide mass fingerprinting (PMF) searches of the LegProt database using MALDI-TOFMS data yielded a significant increase in the identification success rate from 19% (NCBI nr) to 34% (LegProt) while the average scores and average number of matched peptides showed insignificant changes. The results demonstrate that the LegProt database significantly increases legume protein identification success rates and the confidence levels compared to the commonly used NCBI nr. These improvements are primarily due to the presence of a large number of legume specific TC sequences in the LegProt database that were not found in NCBI nr. The LegProt database is freely available for download (http://bioinfo.noble.org/manuscript-support/legumedb) and will serve as a valuable resource for legume proteomics.

  12. Phylogenetic molecular function annotation

    NASA Astrophysics Data System (ADS)

    Engelhardt, Barbara E.; Jordan, Michael I.; Repo, Susanna T.; Brenner, Steven E.

    2009-07-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

  13. Visualizing GO Annotations.

    PubMed

    Supek, Fran; Škunca, Nives

    2017-01-01

    Contemporary techniques in biology produce readouts for large numbers of genes simultaneously, the typical example being differential gene expression measurements. Moreover, those genes are often richly annotated using GO terms that describe gene function and that can be used to summarize the results of the genome-scale experiments. However, making sense of such GO enrichment analyses may be challenging. For instance, overrepresented GO functions in a set of differentially expressed genes are typically output as a flat list, a format not adequate to capture the complexities of the hierarchical structure of the GO annotation labels.In this chapter, we survey various methods to visualize large, difficult-to-interpret lists of GO terms. We catalog their availability-Web-based or standalone, the main principles they employ in summarizing large lists of GO terms, and the visualization styles they support. These brief commentaries on each software are intended as a helpful inventory, rather than comprehensive descriptions of the underlying algorithms. Instead, we show examples of their use and suggest that the choice of an appropriate visualization tool may be crucial to the utility of GO in biological discovery.

  14. Phylogenetic molecular function annotation

    PubMed Central

    Engelhardt, Barbara E; Jordan, Michael I; Repo, Susanna T; Brenner, Steven E

    2010-01-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called “phylogenomics”) is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods. PMID:20664722

  15. A method for including protein flexibility in protein-ligand docking: improving tools for database mining and virtual screening.

    PubMed

    Broughton, H B

    2000-06-01

    Second-generation methods for docking ligands into their biological receptors, such as FLOG, provide for flexibility of the ligand but not of the receptor. Molecular dynamics based methods, such as free energy perturbation, account for flexibility, solvent effects, etc., but are very time consuming. We combined the use of statistical analysis of conformational samples from short-run protein molecular dynamics with grid-based docking protocols and demonstrated improved performance in two test cases. Our statistical analysis explores the importance of the average strength of a potential interaction with the biological target and optionally applies a weighting depending on the variability in the strength of the interaction seen during dynamics simulation. Using these methods, we improved the num-top-ranked 10% of a database of drug-like molecules, in searches based on the three-dimensional structure of the protein. These methods are able to match the ability of manual docking to assess likely inactivity on steric grounds and indeed to rank order ligands from a homologous series of cyclooxygenase-2 inhibitors with good correlation to their true activity. Furthermore, these methods reduce the need for human intervention in setting up molecular docking experiments.

  16. Human object annotation for surveillance video forensics

    NASA Astrophysics Data System (ADS)

    Fraz, Muhammad; Zafar, Iffat; Tzanidou, Giounona; Edirisinghe, Eran A.; Sarfraz, Muhammad Saquib

    2013-10-01

    A system that can automatically annotate surveillance video in a manner useful for locating a person with a given description of clothing is presented. Each human is annotated based on two appearance features: primary colors of clothes and the presence of text/logos on clothes. The annotation occurs after a robust foreground extraction stage employing a modified Gaussian mixture model-based approach. The proposed pipeline consists of a preprocessing stage where color appearance of an image is improved using a color constancy algorithm. In order to annotate color information for human clothes, we use the color histogram feature in HSV space and find local maxima to extract dominant colors for different parts of a segmented human object. To detect text/logos on clothes, we begin with the extraction of connected components of enhanced horizontal, vertical, and diagonal edges in the frames. These candidate regions are classified as text or nontext on the basis of their local energy-based shape histogram features. Further, to detect humans, a novel technique has been proposed that uses contourlet transform-based local binary pattern (CLBP) features. In the proposed method, we extract the uniform direction invariant LBP feature descriptor for contourlet transformed high-pass subimages from vertical and diagonal directional bands. In the final stage, extracted CLBP descriptors are classified by a trained support vector machine. Experimental results illustrate the superiority of our method on large-scale surveillance video data.

  17. Fire-induced water-repellent soils, an annotated bibliography

    USGS Publications Warehouse

    Kalendovsky, M.A.; Cannon, S.H.

    1997-01-01

    The development and nature of water-repellent, or hydrophobic, soils are important issues in evaluating hillslope response to fire. The following annotated bibliography was compiled to consolidate existing published research on the topic. Emphasis was placed on the types, causes, effects and measurement techniques of water repellency, particularly with respect to wildfires and prescribed burns. Each annotation includes a general summary of the respective publication, as well as highlights of interest to this focus. Although some references on the development of water repellency without fires, the chemistry of hydrophobic substances, and remediation of water-repellent conditions are included, coverage of these topics is not intended to be comprehensive. To develop this database, the GeoRef, Agricola, and Water Resources Abstracts databases were searched for appropriate references, and the bibliographies of each reference were then reviewed for additional entries. Additional references will be added to this bibliography as they become available. The annotated bibliography can be accessed on the Web at http://geohazards.cr.usgs.gov/html_files/landslides/ofr97-720/biblio.html. A database consisting of the references and keywords is available through a link at the above address. This database was compiled using EndNote2 plus software by Niles and Associates, and is necessary to search the database.

  18. Unlimited Thirst for Genome Sequencing, Data Interpretation, and Database Usage in Genomic Era: The Road towards Fast-Track Crop Plant Improvement

    PubMed Central

    Govindaraj, Mahalingam

    2015-01-01

    The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away. PMID:25874133

  19. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  20. The TIGR Maize Database.

    PubMed

    Chan, Agnes P; Pertea, Geo; Cheung, Foo; Lee, Dan; Zheng, Li; Whitelaw, Cathy; Pontaroli, Ana C; SanMiguel, Phillip; Yuan, Yinan; Bennetzen, Jeffrey; Barbazuk, William Brad; Quackenbush, John; Rabinowicz, Pablo D

    2006-01-01

    Maize is a staple crop of the grass family and also an excellent model for plant genetics. Owing to the large size and repetitiveness of its genome, we previously investigated two approaches to accelerate gene discovery and genome analysis in maize: methylation filtration and high C(0)t selection. These techniques allow the construction of gene-enriched genomic libraries by minimizing repeat sequences due to either their methylation status or their copy number, yielding a 7-fold enrichment in genic sequences relative to a random genomic library. Approximately 900,000 gene-enriched reads from maize were generated and clustered into Assembled Zea mays (AZM) sequences. Here we report the current AZM release, which consists of approximately 298 Mb representing 243,807 sequence assemblies and singletons. In order to provide a repository of publicly available maize genomic sequences, we have created the TIGR Maize Database (http://maize.tigr.org). In this resource, we have assembled and annotated the AZMs and used available sequenced markers to anchor AZMs to maize chromosomes. We have constructed a maize repeat database and generated draft sequence assemblies of 287 maize bacterial artificial chromosome (BAC) clone sequences, which we annotated along with 172 additional publicly available BAC clones. All sequences, assemblies and annotations are available at the project website via web interfaces and FTP downloads.

  1. Semantic Annotations and Querying of Web Data Sources

    NASA Astrophysics Data System (ADS)

    Hornung, Thomas; May, Wolfgang

    A large part of the Web, actually holding a significant portion of the useful information throughout the Web, consists of views on hidden databases, provided by numerous heterogeneous interfaces that are partly human-oriented via Web forms ("Deep Web"), and partly based on Web Services (only machine accessible). In this paper we present an approach for annotating these sources in a way that makes them citizens of the Semantic Web. We illustrate how queries can be stated in terms of the ontology, and how the annotations are used to selected and access appropriate sources and to answer the queries.

  2. An Annotated Bibliography on Children.

    ERIC Educational Resources Information Center

    Bureau of Libraries and Educational Technology (DHEW/OE), Washington, DC.

    This annotated bibliography is a highly selective list of materials published in the last five years on the major problems, trends, methodologies and achievements in the field of child development. It contains annotated references to approximately 500 books, periodicals, technical reports, government documents, legislative materials, professional…

  3. Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets

    PubMed Central

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M. Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J.

    2014-01-01

    Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

  4. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.

  5. Cognition inspired framework for indoor scene annotation

    NASA Astrophysics Data System (ADS)

    Ye, Zhipeng; Liu, Peng; Zhao, Wei; Tang, Xianglong

    2015-09-01

    We present a simple yet effective scene annotation framework based on a combination of bag-of-visual words (BoVW), three-dimensional scene structure estimation, scene context, and cognitive theory. From a macroperspective, the proposed cognition-based hybrid motivation framework divides the annotation problem into empirical inference and real-time classification. Inspired by the inference ability of human beings, common objects of indoor scenes are defined for experience-based inference, while in the real-time classification stage, an improved BoVW-based multilayer abstract semantics labeling method is proposed by introducing abstract semantic hierarchies to narrow the semantic gap and improve the performance of object categorization. The proposed framework was evaluated on a variety of common data sets and experimental results proved its effectiveness.

  6. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  7. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    PubMed

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  8. The Society of Thoracic Surgeons Adult Cardiac Surgery Database: The Driving Force for Improvement in Cardiac Surgery.

    PubMed

    Winkley Shroyer, Annie Laurie; Bakaeen, Faisal; Shahian, David M; Carr, Brendan M; Prager, Richard L; Jacobs, Jeffrey P; Ferraris, Victor; Edwards, Fred; Grover, Frederick L

    2015-01-01

    Initiated in 1989, the Society of Thoracic Surgeons (STS) Adult Cardiac Surgery Database (ACSD) includes more than 1085 participating centers, representing 90%-95% of current US-based adult cardiac surgery hospitals. Since its inception, the primary goal of the STS ACSD has been to use clinical data to track and improve cardiac surgical outcomes. Patients' preoperative risk characteristics, procedure-related processes of care, and clinical outcomes data have been captured and analyzed, with timely risk-adjusted feedback reports to participating providers. In 2006, STS initiated an external audit process to evaluate STS ACSD completeness and accuracy. Given the extremely high inter-rater reliability and completeness rates of STS ACSD, it is widely regarded as the "gold standard" for benchmarking cardiac surgery risk-adjusted outcomes. Over time, STS ACSD has expanded its quality horizons beyond the traditional focus on isolated, risk-adjusted short-term outcomes such as perioperative morbidity and mortality. New quality indicators have evolved including composite measures of key processes of care and outcomes (risk-adjusted morbidity and risk-adjusted mortality), longer-term outcomes, and readmissions. Resource use and patient-reported outcomes would be added in the future. These additional metrics provide a more comprehensive perspective on quality as well as additional end points. Widespread acceptance and use of STS ACSD has led to a cultural transformation within cardiac surgery by providing nationally benchmarked data for internal quality assessment, aiding data-driven quality improvement activities, serving as the basis for a voluntary public reporting program, advancing cardiac surgery care through STS ACSD-based research, and facilitating data-driven informed consent dialogues and alternative treatment-related discussions.

  9. IDPredictor: predict database links in biomedical database.

    PubMed

    Mehlhorn, Hendrik; Lange, Matthias; Scholz, Uwe; Schreiber, Falk

    2012-06-26

    Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data is spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge except out of the interlinked databases. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities. This issue is being hampered by the fact, that only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results. The method is implemented as the tool IDPredictor, which is published under the DOI 10.5447/IPK/2012/4 and is freely available using the URL: http://dx.doi.org/10.5447/IPK/2012/4.

  10. Fast T Wave Detection Calibrated by Clinical Knowledge with Annotation of P and T Waves

    PubMed Central

    Elgendi, Mohamed; Eskofier, Bjoern; Abbott, Derek

    2015-01-01

    Background There are limited studies on the automatic detection of T waves in arrhythmic electrocardiogram (ECG) signals. This is perhaps because there is no available arrhythmia dataset with annotated T waves. There is a growing need to develop numerically-efficient algorithms that can accommodate the new trend of battery-driven ECG devices. Moreover, there is also a need to analyze long-term recorded signals in a reliable and time-efficient manner, therefore improving the diagnostic ability of mobile devices and point-of-care technologies. Methods Here, the T wave annotation of the well-known MIT-BIH arrhythmia database is discussed and provided. Moreover, a simple fast method for detecting T waves is introduced. A typical T wave detection method has been reduced to a basic approach consisting of two moving averages and dynamic thresholds. The dynamic thresholds were calibrated using four clinically known types of sinus node response to atrial premature depolarization (compensation, reset, interpolation, and reentry). Results The determination of T wave peaks is performed and the proposed algorithm is evaluated on two well-known databases, the QT and MIT-BIH Arrhythmia databases. The detector obtained a sensitivity of 97.14% and a positive predictivity of 99.29% over the first lead of the validation databases (total of 221,186 beats). Conclusions We present a simple yet very reliable T wave detection algorithm that can be potentially implemented on mobile battery-driven devices. In contrast to complex methods, it can be easily implemented in a digital filter design. PMID:26197321

  11. Processing sequence annotation data using the Lua programming language.

    PubMed

    Ueno, Yutaka; Arita, Masanori; Kumagai, Toshitaka; Asai, Kiyoshi

    2003-01-01

    The data processing language in a graphical software tool that manages sequence annotation data from genome databases should provide flexible functions for the tasks in molecular biology research. Among currently available languages we adopted the Lua programming language. It fulfills our requirements to perform computational tasks for sequence map layouts, i.e. the handling of data containers, symbolic reference to data, and a simple programming syntax. Upon importing a foreign file, the original data are first decomposed in the Lua language while maintaining the original data schema. The converted data are parsed by the Lua interpreter and the contents are stored in our data warehouse. Then, portions of annotations are selected and arranged into our catalog format to be depicted on the sequence map. Our sequence visualization program was successfully implemented, embedding the Lua language for processing of annotation data and layout script. The program is available at http://staff.aist.go.jp/yutaka.ueno/guppy/.

  12. Semantic annotation for live and posterity logging of video documents

    NASA Astrophysics Data System (ADS)

    Bertini, Marco; Del Bimbo, Alberto; Nunziati, W.

    2003-06-01

    Broadcasters usually envision two basic applications for video databases: Live Logging and Posterity Logging. The former aims at providing effective annotation of video in quasi-real time and supports extraction of meaningful clips from the live stream; it is usually performed by assistant producers working at the same location of the event. The latter provides annotation for later reuse of video material and is the prerequisite for retrieval by content from video digital libraries; it is performed by trained librarians. Both require that annotation is performed, at a great extent, automatically. Video information structure must encompass both low-intermediate level video organization and event relationships that define specific highlights and situations. Analysis of the visual data of the video stream permits to extract hints, identify events and detect highlights. All of this must be supported by a-priori knowledge of the video domain and effective reasoning engines capable to capture the inherent semantics of the visual events.

  13. Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

    PubMed Central

    Carrell, D. S.; Cronkite, D. J.; Malin, B. A.; Aberdeen, J. S.; Hirschman, L.

    2016-01-01

    Summary Background Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized. Objectives This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size. Methods Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation. Results Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when measured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $0.71 for an individual annotator to $377 for annotations discovered only by a fourth annotator. Conclusions Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly. PMID:27405787

  14. Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data

    PubMed Central

    2011-01-01

    Background Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. Description The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps

  15. Developing a biocuration workflow for AgBase, a non-model organism database

    PubMed Central

    Pillai, Lakshmi; Chouvarine, Philippe; Tudor, Catalina O.; Schmidt, Carl J.; Vijay-Shanker, K.; McCarthy, Fiona M.

    2012-01-01

    AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in Ag

  16. Developing a biocuration workflow for AgBase, a non-model organism database.

    PubMed

    Pillai, Lakshmi; Chouvarine, Philippe; Tudor, Catalina O; Schmidt, Carl J; Vijay-Shanker, K; McCarthy, Fiona M

    2012-01-01

    AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as 'in progress' or 'completed'; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase. Database URL

  17. SelenoDB 2.0: annotation of selenoprotein genes in animals and their genetic diversity in humans.

    PubMed

    Romagné, Frédéric; Santesmasses, Didac; White, Louise; Sarangi, Gaurab K; Mariotti, Marco; Hübler, Ron; Weihmann, Antje; Parra, Genís; Gladyshev, Vadim N; Guigó, Roderic; Castellano, Sergi

    2014-01-01

    SelenoDB (http://www.selenodb.org) aims to provide high-quality annotations of selenoprotein genes, proteins and SECIS elements. Selenoproteins are proteins that contain the amino acid selenocysteine (Sec) and the first release of the database included annotations for eight species. Since the release of SelenoDB 1.0 many new animal genomes have been sequenced. The annotations of selenoproteins in new genomes usually contain many errors in major databases. For this reason, we have now fully annotated selenoprotein genes in 58 animal genomes. We provide manually curated annotations for human selenoproteins, whereas we use an automatic annotation pipeline to annotate selenoprotein genes in other animal genomes. In addition, we annotate the homologous genes containing cysteine (Cys) instead of Sec. Finally, we have surveyed genetic variation in the annotated genes in humans. We use exon capture and resequencing approaches to identify single-nucleotide polymorphisms in more than 50 human populations around the world. We thus present a detailed view of the genetic divergence of Sec- and Cys-containing genes in animals and their diversity in humans. The addition of these datasets into the second release of the database provides a valuable resource for addressing medical and evolutionary questions in selenium biology.

  18. Perioperative outcomes following radical prostatectomy for patients with disseminated cancer: An analysis of the National Surgical Quality Improvement Program database

    PubMed Central

    Satkunasivam, Raj; Wallis, Christopher J.D.; Byrne, James; Hoffman, Azik; Cheung, Douglas C.; Kulkarni, Girish S.; Nathens, Avery B.; Nam, Robert K.

    2016-01-01

    Introduction We sought to determine whether patients undergoing radical prostatectomy (RP) in the context of disseminated cancer have higher 30-day complications. Methods We conducted a retrospective cohort study of the National Surgical Quality Improvement Program (NSQIP) database. Men undergoing RP (from January 1, 2005 to December 31, 2014) for prostate cancer were identified and stratified by presence (n=97) or absence (n=27 868) of disseminated cancer. The primary outcome was major complications (death, re-operation, cardiac or neurologic events) within 30 days of surgery. Secondary outcomes included pulmonary, infectious, venous thromboembolic, and bleeding complications; prolonged length of stay; and concomitant procedures (bowel-related, cystectomy, urinary diversion, and major ureteric reconstruction). Odds ratios (OR) for each complication were calculated using univariable logistic regression. Results We did not identify a difference in major complication rates (OR 2.26, 95% confidence interval [CI] 0.71–7.16). Patients with disseminated cancer had increased risk of venous thromboembolic events (OR 3.30, 95% CI 1.04–10.48) and transfusion (OR 2.45, 95% CI 1.18–5.05), but similar odds of pulmonary and infectious complications and length of stay. Bowel procedures were rare, however, a significantly higher proportion of patients with disseminated cancer required bowel procedures (2.1% vs. 0.3%; p=0.03). Patients with disseminated cancer undergoing RP had greater comorbidities and higher predicted probability of morbidity and mortality. This study is limited by its retrospective design, lack of cancer-specific variables, and prostatectomy-specific complications. Conclusions RP in the context of disseminated cancer may be associated with increased perioperative complications. Caution should be exercised in embarking on this practice outside of clinical trials. PMID:28096918

  19. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

    PubMed

    Davis, Allan Peter; Wiegers, Thomas C; Johnson, Robin J; Lay, Jean M; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J

    2013-01-01

    The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

  20. Caliper Context Annotation Library

    SciTech Connect

    2015-09-30

    To understand the performance of parallel programs, developers need to be able to relate performance measurement data with context information, such as the call path / line numbers or iteration numbers where measurements were taken. Caliper provides a generic way to specify and collect multi-dimensional context information across the software stack, and provide ti to third-party measurement tools or write it into a file or database in the form of context streams.

  1. Mouse genome annotation by the RefSeq project.

    PubMed

    McGarvey, Kelly M; Goldfarb, Tamara; Cox, Eric; Farrell, Catherine M; Gupta, Tripti; Joardar, Vinita S; Kodali, Vamsi K; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Webb, David; Wright, Mathew W; Murphy, Terence D; Pruitt, Kim D

    2015-10-01

    Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.

  2. APPRIS: annotation of principal and alternative splice isoforms.

    PubMed

    Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L

    2013-01-01

    Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

  3. Automated Gene Ontology annotation for anonymous sequence data.

    PubMed

    Hennig, Steffen; Groth, Detlef; Lehrach, Hans

    2003-07-01

    Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.

  4. Enriching a biomedical event corpus with meta-knowledge annotation

    PubMed Central

    2011-01-01

    part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event. PMID:21985429

  5. MaizeGDB, the maize model organism database

    Technology Transfer Automated Retrieval System (TEKTRAN)

    MaizeGDB is the maize research community's database for maize genetic and genomic information. In this seminar I will outline our current endeavors including a full website redesign, the status of maize genome assembly and annotation projects, and work toward genome functional annotation. Mechanis...

  6. Enhanced Acylcarnitine Annotation in High-Resolution Mass Spectrometry Data: Fragmentation Analysis for the Classification and Annotation of Acylcarnitines

    PubMed Central

    van der Hooft, Justin J. J.; Ridder, Lars; Barrett, Michael P.; Burgess, Karl E. V.

    2015-01-01

    Metabolite annotation and identification are primary challenges in untargeted metabolomics experiments. Rigorous workflows for reliable annotation of mass features with chemical structures or compound classes are needed to enhance the power of untargeted mass spectrometry. High-resolution mass spectrometry considerably improves the confidence in assigning elemental formulas to mass features in comparison to nominal mass spectrometry, and embedding of fragmentation methods enables more reliable metabolite annotations and facilitates metabolite classification. However, the analysis of mass fragmentation spectra can be a time-consuming step and requires expert knowledge. This study demonstrates how characteristic fragmentations, specific to compound classes, can be used to systematically analyze their presence in complex biological extracts like urine that have undergone untargeted mass spectrometry combined with data dependent or targeted fragmentation. Human urine extracts were analyzed using normal phase liquid chromatography (hydrophilic interaction chromatography) coupled to an Ion Trap-Orbitrap hybrid instrument. Subsequently, mass chromatograms and collision-induced dissociation and higher-energy collisional dissociation (HCD) fragments were annotated using the freely available MAGMa software1. Acylcarnitines play a central role in energy metabolism by transporting fatty acids into the mitochondrial matrix. By filtering on a combination of a mass fragment and neutral loss designed based on the MAGMa fragment annotations, we were able to classify and annotate 50 acylcarnitines in human urine extracts, based on high-resolution mass spectrometry HCD fragmentation spectra at different energies for all of them. Of these annotated acylcarnitines, 31 are not described in HMDB yet and for only 4 annotated acylcarnitines the fragmentation spectra could be matched to reference spectra. Therefore, we conclude that the use of mass fragmentation filters within the context

  7. Patient Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Simmons, Jeannette

    Topics included in this annotated bibliography on patient education are (1) background on development of patient education programs, (2) patient education interventions, (3) references for health professionals, and (4) research and evaluation in patient education. (TA)

  8. Publication Production: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Firman, Anthony H.

    1994-01-01

    Offers brief annotations of 52 articles and papers on document production (from the Society for Technical Communication's journal and proceedings) on 9 topics: information processing, document design, using color, typography, tables, illustrations, photography, printing and binding, and production management. (SR)

  9. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus

  10. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

  11. The Pancreatic Expression database: 2011 update

    PubMed Central

    Cutts, Rosalind J.; Gadaleta, Emanuela; Hahn, Stephan A.; Crnogorac-Jurcevic, Tatjana; Lemoine, Nicholas R.; Chelala, Claude

    2011-01-01

    The Pancreatic Expression database (PED, http://www.pancreasexpression.org) has established itself as the main repository for pancreatic-derived -omics data. For the past 3 years, its data content and access have increased substantially. Here we describe several of its new and improved features, such as data content, which now includes over 60 000 measurements derived from transcriptomics, proteomics, genomics and miRNA profiles from various pancreas-centred reports on a broad range of specimen and experimental types. We also illustrate the capabilities of its interface, which allows integrative queries that can combine PED data with a growing number of biological resources such as NCBI, Ensembl, UniProt and Reactome. Thus, PED is capable of retrieving and integrating different types of -omics, annotations and clinical data. We also focus on the importance of data sharing and interoperability in the cancer field, and the integration of PED into the International Cancer Genome Consortium (ICGC) data portal. PMID:20959292

  12. The Pancreatic Expression database: 2011 update.

    PubMed

    Cutts, Rosalind J; Gadaleta, Emanuela; Hahn, Stephan A; Crnogorac-Jurcevic, Tatjana; Lemoine, Nicholas R; Chelala, Claude

    2011-01-01

    The Pancreatic Expression database (PED, http://www.pancreasexpression.org) has established itself as the main repository for pancreatic-derived -omics data. For the past 3 years, its data content and access have increased substantially. Here we describe several of its new and improved features, such as data content, which now includes over 60,000 measurements derived from transcriptomics, proteomics, genomics and miRNA profiles from various pancreas-centred reports on a broad range of specimen and experimental types. We also illustrate the capabilities of its interface, which allows integrative queries that can combine PED data with a growing number of biological resources such as NCBI, Ensembl, UniProt and Reactome. Thus, PED is capable of retrieving and integrating different types of -omics, annotations and clinical data. We also focus on the importance of data sharing and interoperability in the cancer field, and the integration of PED into the International Cancer Genome Consortium (ICGC) data portal.

  13. The Aerospace Database data element dictionary with issues and recommendations from the meetings of July 24-25, August 13-14, and September 24-25, 1991

    NASA Technical Reports Server (NTRS)

    1991-01-01

    The present volume contains descriptions of the individual fields (data elements) which comprise the bibliographic records of the Aerospace Database. Indexes by field name and field mnemonic are provided. In addition, the issues and recommendations defined by the NASA STI Database Upgrade Working Group are included as annotations to the individual field descriptions and are listed at the end of the volume. The activities of the Working Group were initiated by the NASA STI Program Coordinating Council as part of an effort to improve overall database quality.

  14. Uncertainty modeling for ontology-based mammography annotation with intelligent BI-RADS scoring.

    PubMed

    Bulu, Hakan; Alpkocak, Adil; Balci, Pinar

    2013-05-01

    This paper presents an ontology-based annotation system and BI-RADS (Breast Imaging Reporting and Data System) score reasoning with Semantic Web technologies in mammography. The annotation system is based on the Mammography Annotation Ontology (MAO) where the BI-RADS score reasoning works. However, ontologies are based on crisp logic and they cannot handle uncertainty. Consequently, we propose a Bayesian-based approach to model uncertainty in mammography ontology and make reasoning possible using BI-RADS scores with SQWRL (Semantic Query-enhanced Web Rule Language). First, we give general information about our system and present details of mammography annotation ontology, its main concepts and relationships. Then, we express uncertainty in mammography and present approaches to handle uncertainty issues. System is evaluated with a manually annotated dataset DEMS (Dokuz Eylul University Mammography Set) and DDSM (Digital Database for Screening Mammography). We give the result of experimentations in terms of accuracy, sensitivity, precision and uncertainty level measures.

  15. Protein function prediction and annotation in an integrated environment powered by web services (AFAWE).

    PubMed

    Jöcker, Anika; Hoffmann, Fabian; Groscurth, Andreas; Schoof, Heiko

    2008-10-15

    Many sequenced genes are mainly annotated through automatic transfer of annotation from similar sequences. Manual comparison of results or intermediate results from different tools can help avoid wrong annotations and give hints to the function of a gene even if none of the automated tools could return any result. AFAWE simplifies the task of manual functional annotation by running different tools and workflows for automatic function prediction and displaying the results in a way that facilitates comparison. Because all programs are executed as web services, AFAWE is easily extensible and can directly query primary databases, thereby always using the most up-to-date data sources. Visual filters help to distinguish trustworthy results from non-significant results. Furthermore, an interface to add detailed manual annotation to each gene is provided, which can be displayed to other users.

  16. Genomic variant annotation workflow for clinical applications

    PubMed Central

    Thurnherr, Thomas; Singer, Franziska; Stekhoven, Daniel J.; Beerenwinkel, Niko

    2016-01-01

    Annotation and interpretation of DNA aberrations identified through next-generation sequencing is becoming an increasingly important task. Even more so in the context of data analysis pipelines for medical applications, where genomic aberrations are associated with phenotypic and clinical features. Here we describe a workflow to identify potential gene targets in aberrated genes or pathways and their corresponding drugs. To this end, we provide the R/Bioconductor package rDGIdb, an R wrapper to query the drug-gene interaction database (DGIdb). DGIdb accumulates drug-gene interaction data from 15 different resources and allows filtering on different levels. The rDGIdb package makes these resources and tools available to R users. Moreover, rDGIdb queries can be automated through incorporation of the rDGIdb package into NGS sequencing pipelines. PMID:27990260

  17. VitisExpDB: A Database Resource for Grape Functional Genomics

    Technology Transfer Automated Retrieval System (TEKTRAN)

    VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for Vitis vinifera and non-vinifera grape varieties. Currently, the database stores ~320,000 EST sequences derived from 8 species/hybrids, their annotation details and gene ontology based...

  18. HPIDB 2.0: a curated database for host-pathogen interactions.

    PubMed

    Ammari, Mais G; Gresham, Cathy R; McCarthy, Fiona M; Nanduri, Bindu

    2016-01-01

    Identification and analysis of host-pathogen interactions (HPI) is essential to study infectious diseases. However, HPI data are sparse in existing molecular interaction databases, especially for agricultural host-pathogen systems. Therefore, resources that annotate, predict and display the HPI that underpin infectious diseases are critical for developing novel intervention strategies. HPIDB 2.0 (http://www.agbase.msstate.edu/hpi/main.html) is a resource for HPI data, and contains 45, 238 manually curated entries in the current release. Since the first description of the database in 2010, multiple enhancements to HPIDB data and interface services were made that are described here. Notably, HPIDB 2.0 now provides targeted biocuration of molecular interaction data. As a member of the International Molecular Exchange consortium, annotations provided by HPIDB 2.0 curators meet community standards to provide detailed contextual experimental information and facilitate data sharing. Moreover, HPIDB 2.0 provides access to rapidly available community annotations that capture minimum molecular interaction information to address immediate researcher needs for HPI network analysis. In addition to curation, HPIDB 2.0 integrates HPI from existing external sources and contains tools to infer additional HPI where annotated data are scarce. Compared to other interaction databases, our data collection approach ensures HPIDB 2.0 users access the most comprehensive HPI data from a wide range of pathogens and their hosts (594 pathogen and 70 host species, as of February 2016). Improvements also include enhanced search capacity, addition of Gene Ontology functional information, and implementation of network visualization. The changes made to HPIDB 2.0 content and interface ensure that users, especially agricultural researchers, are able to easily access and analyse high quality, comprehensive HPI data. All HPIDB 2.0 data are updated regularly, are publically available for direct

  19. Protein Model Database

    SciTech Connect

    Fidelis, K; Adzhubej, A; Kryshtafovych, A; Daniluk, P

    2005-02-23

    The phenomenal success of the genome sequencing projects reveals the power of completeness in revolutionizing biological science. Currently it is possible to sequence entire organisms at a time, allowing for a systemic rather than fractional view of their organization and the various genome-encoded functions. There is an international plan to move towards a similar goal in the area of protein structure. This will not be achieved by experiment alone, but rather by a combination of efforts in crystallography, NMR spectroscopy, and computational modeling. Only a small fraction of structures are expected to be identified experimentally, the remainder to be modeled. Presently there is no organized infrastructure to critically evaluate and present these data to the biological community. The goal of the Protein Model Database project is to create such infrastructure, including (1) public database of theoretically derived protein structures; (2) reliable annotation of protein model quality, (3) novel structure analysis tools, and (4) access to the highest quality modeling techniques available.

  20. An Annotated Guide and Interactive Database for Solo Horn Repertoire

    ERIC Educational Resources Information Center

    Schouten, Sarah

    2012-01-01

    Given the horn's lengthy history, it is not surprising that many scholars have examined the evolution of the instrument from the natural horn to the modern horn and its expansive repertoire. Numerous dissertations, theses, and treatises illuminate specific elements of the horn's solo repertoire; however, no scholar has produced a…

  1. Annotated EST database of Heliothis virescens hemocytic immune system transcripts

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genomic and proteomic approaches were applied to characterize the immunoproteome of Heliothis virescens. Larval hemocytic responses to bacterial and baculoviral infection were surveyed using expressed sequence tags (ESTs). 5349 ESTs formed 429 contigs, 258 singlets and 1104 singletons, totalling 1...

  2. Nutrient database improvement project: Separable components and proximate composition of retail cuts from the beef loin and round

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Beef nutrition research has become increasingly important domestically and internationally for the beef industry and its consumers. The objective of this study was to analyze the nutrient composition of ten beef loin and round cuts to update the nutrient data in the USDA National Nutrient Database f...

  3. Improvement of the Cramer classification for oral exposure using the database TTC RepDose - A strategy description

    EPA Science Inventory

    The present report describes a strategy to refine the current Cramer classification of the TTC concept using a broad database (DB) termed TTC RepDose. Cramer classes 1-3 overlap to some extent, indicating a need for a better separation of structural classes likely to be toxic, mo...

  4. CTDB: An Integrated Chickpea Transcriptome Database for Functional and Applied Genomics

    PubMed Central

    Patel, Ravi K.; Garg, Rohini; Jain, Mukesh

    2015-01-01

    Chickpea is an important grain legume used as a rich source of protein in human diet. The narrow genetic diversity and limited availability of genomic resources are the major constraints in implementing breeding strategies and biotechnological interventions for genetic enhancement of chickpea. We developed an integrated Chickpea Transcriptome Database (CTDB), which provides the comprehensive web interface for visualization and easy retrieval of transcriptome data in chickpea. The database features many tools for similarity search, functional annotation (putative function, PFAM domain and gene ontology) search and comparative gene expression analysis. The current release of CTDB (v2.0) hosts transcriptome datasets with high quality functional annotation from cultivated (desi and kabuli types) and wild chickpea. A catalog of transcription factor families and their expression profiles in chickpea are available in the database. The gene expression data have been integrated to study the expression profiles of chickpea transcripts in major tissues/organs and various stages of flower development. The utilities, such as similarity search, ortholog identification and comparative gene expression have also been implemented in the database to facilitate comparative genomic studies among different legumes and Arabidopsis. Furthermore, the CTDB represents a resource for the discovery of functional molecular markers (microsatellites and single nucleotide polymorphisms) between different chickpea types. We anticipate that integrated information content of this database will accelerate the functional and applied genomic research for improvement of chickpea. The CTDB web service is freely available at http://nipgr.res.in/ctdb.html. PMID:26322998

  5. The Genome Database for Rosaceae (GDR): year 10 update.

    PubMed

    Jung, Sook; Ficklin, Stephen P; Lee, Taein; Cheng, Chun-Huai; Blenda, Anna; Zheng, Ping; Yu, Jing; Bombarely, Aureliano; Cho, Ilhyung; Ru, Sushan; Evans, Kate; Peace, Cameron; Abbott, Albert G; Mueller, Lukas A; Olmstead, Mercy A; Main, Dorrie

    2014-01-01

    The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making.

  6. The Genome Database for Rosaceae (GDR): year 10 update

    PubMed Central

    Jung, Sook; Ficklin, Stephen P.; Lee, Taein; Cheng, Chun-Huai; Blenda, Anna; Zheng, Ping; Yu, Jing; Bombarely, Aureliano; Cho, Ilhyung; Ru, Sushan; Evans, Kate; Peace, Cameron; Abbott, Albert G.; Mueller, Lukas A.; Olmstead, Mercy A.; Main, Dorrie

    2014-01-01

    The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making. PMID:24225320

  7. Stackfile Database

    NASA Technical Reports Server (NTRS)

    deVarvalho, Robert; Desai, Shailen D.; Haines, Bruce J.; Kruizinga, Gerhard L.; Gilmer, Christopher

    2013-01-01

    This software provides storage retrieval and analysis functionality for managing satellite altimetry data. It improves the efficiency and analysis capabilities of existing database software with improved flexibility and documentation. It offers flexibility in the type of data that can be stored. There is efficient retrieval either across the spatial domain or the time domain. Built-in analysis tools are provided for frequently performed altimetry tasks. This software package is used for storing and manipulating satellite measurement data. It was developed with a focus on handling the requirements of repeat-track altimetry missions such as Topex and Jason. It was, however, designed to work with a wide variety of satellite measurement data [e.g., Gravity Recovery And Climate Experiment -- GRACE). The software consists of several command-line tools for importing, retrieving, and analyzing satellite measurement data.

  8. MetaBase—the wiki-database of biological databases

    PubMed Central

    Bolser, Dan M.; Chibon, Pierre-Yves; Palopoli, Nicolas; Gong, Sungsam; Jacob, Daniel; Angel, Victoria Dominguez Del; Swan, Dan; Bassi, Sebastian; González, Virginia; Suravajhala, Prashanth; Hwang, Seungwoo; Romano, Paolo; Edwards, Rob; Bishop, Bryan; Eargle, John; Shtatland, Timur; Provart, Nicholas J.; Clements, Dave; Renfro, Daniel P.; Bhak, Daeui; Bhak, Jong

    2012-01-01

    Biology is generating more data than ever. As a result, there is an ever increasing number of publicly available databases that analyse, integrate and summarize the available data, providing an invaluable resource for the biological community. As this trend continues, there is a pressing need to organize, catalogue and rate these resources, so that the information they contain can be most effectively exploited. MetaBase (MB) (http://MetaDatabase.Org) is a community-curated database containing more than 2000 commonly used biological databases. Each entry is structured using templates and can carry various user comments and annotations. Entries can be searched, listed, browsed or queried. The database was created using the same MediaWiki technology that powers Wikipedia, allowing users to contribute on many different levels. The initial release of MB was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue. Since then, approximately 100 databases have been manually collected from the literature, and users have added information for over 240 databases. MB is synchronized annually with the static Molecular Biology Database Collection provided by NAR. To date, there have been 19 significant contributors to the project; each one is listed as an author here to highlight the community aspect of the project. PMID:22139927

  9. Determining similarity of scientific entities in annotation datasets.

    PubMed

    Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas

    2015-01-01

    Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/

  10. Determining similarity of scientific entities in annotation datasets

    PubMed Central

    Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas

    2015-01-01

    Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug–drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called ‘AnnSim’ that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1–1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/ PMID:25725057

  11. VariOtator, a Software Tool for Variation Annotation with the Variation Ontology.

    PubMed

    Schaafsma, Gerard C P; Vihinen, Mauno

    2016-04-01

    The Variation Ontology (VariO) is used for describing and annotating types, effects, consequences, and mechanisms of variations. To facilitate easy and consistent annotations, the online application VariOtator was developed. For variation type annotations, VariOtator is fully automated, accepting variant descriptions in Human Genome Variation Society (HGVS) format, and generating VariO terms, either with or without full lineage, that is, all parent terms. When a coding DNA variant description with a reference sequence is provided, VariOtator checks the description first with Mutalyzer and then generates the predicted RNA and protein descriptions with their respective VariO annotations. For the other sublevels, function, structure, and property, annotations cannot be automated, and VariOtator generates annotation based on provided details. For VariO terms relating to structure and property, one can use attribute terms as modifiers and evidence code terms for annotating experimental evidence. There is an online batch version, and stand-alone batch versions to be used with a Leiden Open Variation Database (LOVD) download file. A SOAP Web service allows client programs to access VariOtator programmatically. Thus, systematic variation effect and type annotations can be efficiently generated to allow easy use and integration of variations and their consequences.

  12. Using comparative genome analysis to identify problems in annotated microbial genomes.

    PubMed

    Poptsova, Maria S; Gogarten, J Peter

    2010-07-01

    Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

  13. Adaptive clustering of image database (ACID) as an efficient tool for improving retrieval in a CBIR system.

    PubMed

    Reljin, Branimir; Zajić, Goran; Reljin, Nikola; Reljin, Irini

    2012-01-01

    The paper describes a content-based image retrieval (CBIR) system with relevance feedback (RF). Instead of standard relevance feedback procedure, an adaptive clustering of image database (ACID) according to particular subjective needs is introduced in our system. Images labeled by the user as relevant are collected in clusters, and their representative members are used in further searching procedure instead of all images contained in the database. By this way, some history of previous retrieving is embedded into a searching process enabling faster and more subjective retrieval. Moreover, clusters are adaptively updated after each retrieving session, following actual user's needs. The efficiency of the proposed ACID system is tested with images from Corel and MIT datasets.

  14. InterPro in 2017—beyond protein family and domain annotations

    PubMed Central

    Finn, Robert D.; Attwood, Teresa K.; Babbitt, Patricia C.; Bateman, Alex; Bork, Peer; Bridge, Alan J.; Chang, Hsin-Yu; Dosztányi, Zsuzsanna; El-Gebali, Sara; Fraser, Matthew; Gough, Julian; Haft, David; Holliday, Gemma L.; Huang, Hongzhan; Huang, Xiaosong; Letunic, Ivica; Lopez, Rodrigo; Lu, Shennan; Marchler-Bauer, Aron; Mi, Huaiyu; Mistry, Jaina; Natale, Darren A.; Necci, Marco; Nuka, Gift; Orengo, Christine A.; Park, Youngmi; Pesseat, Sebastien; Piovesan, Damiano; Potter, Simon C.; Rawlings, Neil D.; Redaschi, Nicole; Richardson, Lorna; Rivoire, Catherine; Sangrador-Vegas, Amaia; Sigrist, Christian; Sillitoe, Ian; Smithers, Ben; Squizzato, Silvano; Sutton, Granger; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C. E.; Wu, Cathy H.; Xenarios, Ioannis; Yeh, Lai-Su; Young, Siew-Yit; Mitchell, Alex L.

    2017-01-01

    InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences. PMID:27899635

  15. Enabling Ontology Based Semantic Queries in Biomedical Database Systems

    PubMed Central

    Zheng, Shuai; Wang, Fusheng; Lu, James; Saltz, Joel

    2013-01-01

    While current biomedical ontology repositories offer primitive query capabilities, it is difficult or cumbersome to support ontology based semantic queries directly in semantically annotated biomedical databases. The problem may be largely attributed to the mismatch between the models of the ontologies and the databases, and the mismatch between the query interfaces of the two systems. To fully realize semantic query capabilities based on ontologies, we develop a system DBOntoLink to provide unified semantic query interfaces by extending database query languages. With DBOntoLink, semantic queries can be directly and naturally specified as extended functions of the database query languages without any programming needed. DBOntoLink is adaptable to different ontologies through customizations and supports major biomedical ontologies hosted at the NCBO BioPortal. We demonstrate the use of DBOntoLink in a real world biomedical database with semantically annotated medical image annotations. PMID:23404054

  16. DBGC: A Database of Human Gastric Cancer

    PubMed Central

    Wang, Chao; Zhang, Jun; Cai, Mingdeng; Zhu, Zhenggang; Gu, Wenjie; Yu, Yingyan; Zhang, Xiaoyan

    2015-01-01

    The Database of Human Gastric Cancer (DBGC) is a comprehensive database that integrates various human gastric cancer-related data resources. Human gastric cancer-related transcriptomics projects, proteomics projects, mutations, biomarkers and drug-sensitive genes from different sources were collected and unified in this database. Moreover, epidemiological statistics of gastric cancer patients in China and clinicopathological information annotated with gastric cancer cases were also integrated into the DBGC. We believe that this database will greatly facilitate research regarding human gastric cancer in many fields. DBGC is freely available at http://bminfor.tongji.edu.cn/dbgc/index.do PMID:26566288

  17. DBGC: A Database of Human Gastric Cancer.

    PubMed

    Wang, Chao; Zhang, Jun; Cai, Mingdeng; Zhu, Zhenggang; Gu, Wenjie; Yu, Yingyan; Zhang, Xiaoyan

    2015-01-01

    The Database of Human Gastric Cancer (DBGC) is a comprehensive database that integrates various human gastric cancer-related data resources. Human gastric cancer-related transcriptomics projects, proteomics projects, mutations, biomarkers and drug-sensitive genes from different sources were collected and unified in this database. Moreover, epidemiological statistics of gastric cancer patients in China and clinicopathological information annotated with gastric cancer cases were also integrated into the DBGC. We believe that this database will greatly facilitate research regarding human gastric cancer in many fields. DBGC is freely available at http://bminfor.tongji.edu.cn/dbgc/index.do.

  18. Extension to distributed annotation system: Summary and summaryplot commands.

    PubMed

    Chrysostomou, Charalambos; Brookes, Anthony J

    2015-01-01

    In recent years, the development of high-throughput sequencing technologies provided an effective way to generate data from entire genomes and test variants from thousands of individuals. The information acquired from analysing the data generated from high-throughput sequencing technologies provided useful insights into applications like whole-exome sequencing and targeted sequencing to discover the genetic cause of complex diseases and drug responses. The Distributed Annotation System (DAS) is one of the proposed solution developed to share and unify biological data from multiple local and remote DAS annotation servers. The researchers can use DAS to request data from federated or centralised databases and integrate them into a unified view. Furthermore, with the use of Reference DAS servers, structural and sequence data can be used to accompany annotation data, for the pursue of new knowledge for a particular feature or region. In this paper, two additional commands, summary and summary-plot commands, to the existing DAS protocol are proposed and implemented. The proposed commands were created in order to give the users the capabilities to request a summary of features for a particular region of interest. The summary command was created in order to extend the capabilities of the current DAS protocol, while the summaryplot command was created to provide a more user-friendly alternative to standard XML DAS responses. Finally, three examples are presented based on the GENCODE annotation data.

  19. Resolving the problem of multiple accessions of the same transcript deposited across various public databases.

    PubMed

    Weirick, Tyler; John, David; Uchida, Shizuka

    2016-02-26

    Maintaining the consistency of genomic annotations is an increasingly complex task because of the iterative and dynamic nature of assembly and annotation, growing numbers of biological databases and insufficient integration of annotations across databases. As information exchange among databases is poor, a 'novel' sequence from one reference annotation could be annotated in another. Furthermore, relationships to nearby or overlapping annotated transcripts are even more complicated when using different genome assemblies. To better understand these problems, we surveyed current and previous versions of genomic assemblies and annotations across a number of public databases containing long noncoding RNA. We identified numerous discrepancies of transcripts regarding their genomic locations, transcript lengths and identifiers. Further investigation showed that the positional differences between reference annotations of essentially the same transcript could lead to differences in its measured expression at the RNA level. To aid in resolving these problems, we present the algorithm 'Universal Genomic Accession Hash (UGAHash)' and created an open source web tool to encourage the usage of the UGAHash algorithm. The UGAHash web tool (http://ugahash.uni-frankfurt.de) can be accessed freely without registration. The web tool allows researchers to generate Universal Genomic Accessions for genomic features or to explore annotations deposited in the public databases of the past and present versions. We anticipate that the UGAHash web tool will be a valuable tool to check for the existence of transcripts before judging the newly discovered transcripts as novel.

  20. ProbMetab: an R package for Bayesian probabilistic annotation of LC–MS-based metabolomics

    PubMed Central

    Silva, Ricardo R.; Jourdan, Fabien; Salvanha, Diego M.; Letisse, Fabien; Jamin, Emilien L.; Guidetti-Gonzalez, Simone; Labate, Carlos A.; Vêncio, Ricardo Z. N.

    2014-01-01

    Summary: We present ProbMetab, an R package that promotes substantial improvement in automatic probabilistic liquid chromatography–mass spectrometry-based metabolome annotation. The inference engine core is based on a Bayesian model implemented to (i) allow diverse source of experimental data and metadata to be systematically incorporated into the model with alternative ways to calculate the likelihood function and (ii) allow sensitive selection of biologically meaningful biochemical reaction databases as Dirichlet-categorical prior distribution. Additionally, to ensure result interpretation by system biologists, we display the annotation in a network where observed mass peaks are connected if their candidate metabolites are substrate/product of known biochemical reactions. This graph can be overlaid with other graph-based analysis, such as partial correlation networks, in a visualization scheme exported to Cytoscape, with web and stand-alone versions. Availability and implementation: ProbMetab was implemented in a modular manner to fit together with established upstream (xcms, CAMERA, AStream, mzMatch.R, etc) and downstream R package tools (GeneNet, RCytoscape, DiffCorr, etc). ProbMetab, along with extensive documentation and case studies, is freely available under GNU license at: http://labpib.fmrp.usp.br/methods/probmetab/. Contact: rvencio@usp.br Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24443383

  1. Expression profiling and functional annotation of noncoding genes across 11 distinct organs in rat development

    PubMed Central

    Wen, Zhuo; Chen, Geng; Zhu, Sibo; Zhu, Jinhang; Li, Bin; Song, Yunjie; Li, Suqing; Shi, Leming; Zheng, Yuanting; Li, Menglong

    2016-01-01

    Accumulating evidence suggests that noncoding RNAs (ncRNAs) have important regulatory functions. However, lacking of functional annotations for ncRNAs hampered us from carrying out the subsequent functional or predictive research. Here we dissected the expression profiles of 3,458 rat noncoding genes using rat bodymap RNA-sequencing data consisting of 11 solid organs over four developmental stages (juvenile, adolescent, adult and aged) from both sexes, and conducted a comprehensive analysis of differentially expressed noncoding genes (DEnGs) between various conditions. We then constructed a co-expression network between protein-coding and noncoding genes to infer biological functions of noncoding genes. Modules of interest were linked to online databases including DAVID for functional annotation and pathway analysis. Our results indicated that noncoding genes are functionally enriched through pathways similar to those of protein-coding genes. Terms about development of the immune system were enriched with genes from age-related modules, whereas terms about sexual reproduction were enriched with genes in sex-related modules. We also built connection networks on some significant modules to visualize the interactions and regulatory relationship between protein-coding and noncoding genes. Our study could improve our understanding and facilitate a deeper investigation on organ/age/sex-related regulatory events of noncoding genes, which may lead to a superior preclinical model for drug development and translational medicine. PMID:27934932

  2. Structure and functional annotation of hypothetical proteins having putative Rubisco activase function from Vitis vinifera.

    PubMed

    Kumar, Suresh

    2015-01-01

    Rubisco is a very large, complex and one of the most abundant proteins in the world and comprises up to 50% of all soluble protein in plants. The activity of Rubisco, the enzyme that catalyzes CO2 assimilation in photosynthesis, is regulated by Rubisco activase (Rca). In the present study, we searched for hypothetical protein of Vitis vinifera which has putative Rubisco activase function. The Arabidopsis and tobacco Rubisco activase protein sequences were used as seed sequences to search against Vitis vinifera in UniprotKB database. The selected hypothetical proteins of Vitis vinifera were subjected to sequence, structural and functional annotation. Subcellular localization predictions suggested it to be cytoplasmic protein. Homology modelling was used to define the three-dimensional (3D) structure of selected hypothetical proteins of Vitis vinifera. Template search revealed that all the hypothetical proteins share more than 80% sequence identity with structure of green-type Rubisco activase from tobacco, indicating proteins are evolutionary conserved. The homology modelling was generated using SWISS-MODEL. Several quality assessment and validation parameters computed indicated that homology models are reliable. Further, functional annotation through PFAM, CATH, SUPERFAMILY, CDART suggested that selected hypothetical proteins of Vitis vinifera contain ATPase family associated with various cellular activities (AAA) and belong to the AAA+ super family of ring-shaped P-loop containing nucleoside triphosphate hydrolases. This study will lead to research in the optimization of the functionality of Rubisco which has large implication in the improvement of plant productivity and resource use efficiency.

  3. Oral History: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Friedman, Paul G.

    Defining oral history as a method of inquiry by which the memories of individuals are elicited, preserved in interview transcripts or on tape recordings, and then used to enrich understanding of individuals' lives and the events in which they participated, this annotated bibliography provides a broad overview and a sampling of the resources…

  4. Music Analysis: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Fink, Michael

    One hundred and forty citations comprise this annotated bibliography of books, articles, and selected dissertations that encompass trends in music theory and k-16 music education since the late 19th century. Special emphasis is upon writings since the 1950's. During earlier development, music analysts concentrated upon the elements of music (i.e.,…

  5. Annotated Selected Puerto Rican Bibliography.

    ERIC Educational Resources Information Center

    Bravo, Enrique R., Comp.

    This work represents an effort on the part of The Urban Center to come one step closer to the realization of its goal to further the growth of ethnic studies. After extensive consultation with educationists from within and without the Puerto Rican community, it was decided that an annotated bilingual bibliography should be published to assist and…

  6. Radiocarbon Dating: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Fortine, Suellen

    This selective annotated bibliography covers various sources of information on the radiocarbon dating method, including journal articles, conference proceedings, and reports, reflecting the most important and useful sources of the last 25 years. The bibliography is divided into five parts--general background on radiocarbon, radiocarbon dating,…

  7. Hispanic Heritage. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Denver Univ., CO. School of Education.

    This annotated bibliography of a wide range of materials for the social studies teacher is concerned with the Hispano heritage. The sections are introduced by a brief description. The sections are: 1) general materials, 2) the land and the people, 3) the European background, 4) Spain's colonial system, 5) the Spanish borderlands, 6) the Anglo…

  8. Annotated Bibliography on Humanistic Education

    ERIC Educational Resources Information Center

    Ganung, Cynthia

    1975-01-01

    Part I of this annotated bibliography deals with books and articles on such topics as achievement motivation, process education, transactional analysis, discipline without punishment, role-playing, interpersonal skills, self-acceptance, moral education, self-awareness, values clarification, and non-verbal communication. Part II focuses on…

  9. Annotated Videography. Part 3. [Revised].

    ERIC Educational Resources Information Center

    United States Holocaust Memorial Museum, Washington, DC.

    This annotated videography has been designed to identify videotapes addressing Holocaust history that have been used effectively in classrooms and are available readily to most communities. The guide is divided into 15 topical categories, including: life before the Holocaust; perpetrators; propaganda; racism; antisemitism; mosaic of victims;…

  10. English Language Learners: Annotated Bibliography

    ERIC Educational Resources Information Center

    Hector-Mason, Anestine; Bardack, Sarah

    2010-01-01

    This annotated bibliography represents a first step toward compiling a comprehensive overview of current research on issues related to English language learners (ELLs). It is intended to be a resource for researchers, policymakers, administrators, and educators who are engaged in efforts to bridge the divide between research, policy, and practice…

  11. ANNOTATED BIBLIOGRAPHY OF GEOLOGICAL EDUCATION.

    ERIC Educational Resources Information Center

    BERG, J. ROBERT; AND OTHERS

    ARTICLES ABOUT GEOLOGICAL EDUCATION WRITTEN DURING THE PERIOD 1919-62 ARE INCLUDED IN THIS ANNOTATED BIBLIOGRAPHY. RECOMMENDATIONS OF INDIVIDUAL EDUCATORS AND PROFESSIONAL GROUPS FOR THE UNDERGRADUATE AND GRADUATE PREPARATION OF GEOLOGISTS ARE CONTAINED IN MOST OF THE ITEMS. THE ARTICLES WERE ORIGINALLY PUBLISHED IN PROFESSIONAL JOURNALS OR…

  12. Vietnamese Amerasians: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Johnson, Mark C.; And Others

    This annotated bibliography on Vietnamese Amerasians includes primary and secondary sources as well as reviews of three documentary films. Sources were selected in order to provide an overview of the historical and political context of Amerasian resettlement and a review of the scant available research on coping and adaptation with this…

  13. Child Development: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Dickerson, LaVerne Thornton, Comp.

    This annotated bibliography focuses on recent publications dealing with factors that influence child growth and development, rather than the developmental processes themselves. Topics include: general sources on child development; physical and perceptual-motor development; cognitive development; social and personality development; and play.…

  14. MSDAC Resource Library Annotated Bibliography.

    ERIC Educational Resources Information Center

    Watson, Cristel; And Others

    This annotated bibliography lists books, films, filmstrips, recordings, and booklets on sex equity. Entries are arranged according to the following topics: career resources, curriculum resources, management, sex equity, sex roles, women's studies, student activities, and sex-fair fiction. Included in each entry are name of author, editor or…

  15. Migrant Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Palmer, Barbara C., Comp.

    Materials selected for inclusion in the annotated bibliography of 139 publications from 1970 to 1980 give a general understanding of the lives of migrant children, their educational needs and problems, and various attempts made to meet those needs. The bibliography, a valuable tool for researchers and teachers in migrant education, includes books,…

  16. Nikos Kazantzakis: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Qiu, Kui

    This research paper consists of an annotated bibliography about Nikos Kazantzakis, one of the major modern Greek writers and author of "The Last Temptation of Christ,""Zorba the Greek," and many other works. Because of Kazantzakis' position in world literature there are many critical works about him; however, bibliographical…

  17. Meaningful Assessment: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Thrond, Mary A.

    The annotated bibliography contains citations of nine references on alternative student assessment methods in second language programs, particularly at the secondary school level. The references include a critique of conventional reading comprehension assessment, a discussion of performance assessment, a proposal for a multi-trait, multi-method…

  18. Workforce Reductions. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Hickok, Thomas A.; Hickok, Thomas A.

    This report, which is based on a review of practitioner-oriented sources and scholarly journals, uses a three-part framework to organize annotated bibliographies that, together, list a total of 104 sources that provide the following three perspectives on work force reduction issues: organizational, organizational-individual relationship, and…

  19. Appalachian Women. An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Hamm, Mary Margo

    This bibliography compiles annotations of 178 books, journal articles, ERIC documents, and dissertations on Appalachian women and their social, cultural, and economic environment. Entries were published 1966-93 and are listed in the following categories: (1) authors and literary criticism; (2) bibliographies and resource guides; (3) economics,…

  20. Teacher Evaluation: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    McKenna, Bernard H.; And Others

    In his introduction to the 86-item annotated bibliography by Mueller and Poliakoff, McKenna discusses his views on teacher evaluation and his impressions of the documents cited. He observes, in part, that the current concern is with the process of evaluation and that most researchers continue to believe that student achievement is the most…

  1. A new Holocene sea-level database for the US Gulf Coast: Improving constraints for past and future sea levels

    NASA Astrophysics Data System (ADS)

    Hijma, M.; Tornqvist, T. E.; Hu, P.; Gonzalez, J.; Hill, D. F.; Horton, B. P.; Engelhart, S. E.

    2011-12-01

    The interpretation of present-day sea-level change, as well as the prediction of future relative sea-level (RSL) rise and its spatial variability, depend increasingly on the ability of glacial isostatic adjustment (GIA) models to reveal non-eustatic components of RSL change. GIA results from the redistribution of mass due to the growth and decay of ice sheets. As a consequence, formerly ice-covered areas are still rebounding and currently experience RSL fall, while in other areas the rate of RSL rise is enhanced due to glacial forebulge collapse. The development of GIA models relies to a large extent on the availability of quality-controlled Holocene RSL data. There is thus an urgent need for systematically compiled and publicly available databases of geological RSL data that can be used not only for the purposes mentioned above, but also can serve to underpin coastal management and policy decisions. We have focused our efforts to develop a Holocene sea-level database for the Atlantic and Gulf coasts of the US. Many of the research problems that can be addressed with this sea-level database revolve around the identification of crustal motions due to glacial forebulge collapse that affects the entire region and likely extends beyond South Florida. For the east coast, GIA-related subsidence rates have been calculated with unprecedented precision: <0.8 mm a-1 in Maine, increasing to rates of 1.7 mm a-1 in Delaware, and a return to rates <0.9 mm a-1 in the Carolinas. Here, we first define our methodology to reconstruct RSL, with particular reference to the quantification of age and elevation errors. Many sea-level indicators are related to a specific tide level (e.g., peat that formed between highest astronomical tide and mean high water level). We use paleotidal modeling to account for any changes during the Holocene. We furthermore highlight a number of errors associated with 14C dating that have rarely, if ever, been considered in previous studies of this nature

  2. Systems Theory and Communication. Annotated Bibliography.

    ERIC Educational Resources Information Center

    Covington, William G., Jr.

    This annotated bibliography presents annotations of 31 books and journal articles dealing with systems theory and its relation to organizational communication, marketing, information theory, and cybernetics. Materials were published between 1963 and 1992 and are listed alphabetically by author. (RS)

  3. Video annotations of Mexican nature in a collaborative environment

    NASA Astrophysics Data System (ADS)

    Oropesa Morales, Lester Arturo; Montoya Obeso, Abraham; Hernández García, Rosaura; Cocolán Almeda, Sara Ivonne; García Vázquez, Mireya Saraí; Benois-Pineau, Jenny; Zamudio Fuentes, Luis Miguel; Martinez Nuño, Jesús A.; Ramírez Acosta, Alejandro Alvaro

    2015-09-01

    Multimedia content production and storage in repositories are now an increasingly widespread practice. Indexing concepts for search in multimedia libraries are very useful for users of the repositories. However the search tools of content-based retrieval and automatic video tagging, still do not have great consistency. Regardless of how these systems are implemented, it is of vital importance to possess lots of videos that have concepts tagged with ground truth (training and testing sets). This paper describes a novel methodology to make complex annotations on video resources through ELAN software. The concepts are annotated and related to Mexican nature in a High Level Features (HLF) from development set of TRECVID 2014 in a collaborative environment. Based on this set, each nature concept observed is tagged on each video shot using concepts of the TRECVid 2014 dataset. We also propose new concepts, -like tropical settings, urban scenes, actions, events, weather, places for name a few. We also propose specific concepts that best describe video content of Mexican culture. We have been careful to get the database tagged with concepts of nature and ground truth. It is evident that a collaborative environment is more suitable for annotation of concepts related to ground truth and nature. As a result a Mexican nature database was built. It also is the basis for testing and training sets to automatically classify new multimedia content of Mexican nature.

  4. Biofuel Database

    National Institute of Standards and Technology Data Gateway

    Biofuel Database (Web, free access)   This database brings together structural, biological, and thermodynamic data for enzymes that are either in current use or are being considered for use in the production of biofuels.

  5. Use of Annotations for Component and Framework Interoperability

    NASA Astrophysics Data System (ADS)

    David, O.; Lloyd, W.; Carlson, J.; Leavesley, G. H.; Geter, F.

    2009-12-01

    The popular programming languages Java and C# provide annotations, a form of meta-data construct. Software frameworks for web integration, web services, database access, and unit testing now take advantage of annotations to reduce the complexity of APIs and the quantity of integration code between the application and framework infrastructure. Adopting annotation features in frameworks has been observed to lead to cleaner and leaner application code. The USDA Object Modeling System (OMS) version 3.0 fully embraces the annotation approach and additionally defines a meta-data standard for components and models. In version 3.0 framework/model integration previously accomplished using API calls is now achieved using descriptive annotations. This enables the framework to provide additional functionality non-invasively such as implicit multithreading, and auto-documenting capabilities while achieving a significant reduction in the size of the model source code. Using a non-invasive methodology leads to models and modeling components with only minimal dependencies on the modeling framework. Since models and modeling components are not directly bound to framework by the use of specific APIs and/or data types they can more easily be reused both within the framework as well as outside of it. To study the effectiveness of an annotation based framework approach with other modeling frameworks, a framework-invasiveness study was conducted to evaluate the effects of framework design on model code quality. A monthly water balance model was implemented across several modeling frameworks and several software metrics were collected. The metrics selected were measures of non-invasive design methods for modeling frameworks from a software engineering perspective. It appears that the use of annotations positively impacts several software quality measures. In a next step, the PRMS model was implemented in OMS 3.0 and is currently being implemented for water supply forecasting in the

  6. BioC viewer: a web-based tool for displaying and merging annotations in BioC.

    PubMed

    Shin, Soo-Yong; Kim, Sun; Wilbur, W John; Kwon, Dongseop

    2016-01-01

    BioC is an XML-based format designed to provide interoperability for text mining tools and manual curation results. A challenge of BioC as a standard format is to align annotations from multiple systems. Ideally, this should not be a major problem if users follow guidelines given by BioC key files. Nevertheless, the misalignment between text and annotations happens quite often because different systems tend to use different software development environments, e.g. ASCII vs. Unicode. We first implemented the BioC Viewer to assist BioGRID curators as a part of the BioCreative V BioC track (Collaborative Biocurator Assistant Task). For the BioC track, the BioC Viewer helped curate protein-protein interaction and genetic interaction pairs appearing in full-text articles. Here, we describe the BioC Viewer itself as well as improvements made to the BioC Viewer since the BioCreative V Workshop to address the misalignment issue of BioC annotations. While uploading BioC files, a BioC merge process is offered when there are files from the same full-text article. If there is a mismatch between an annotated offset and text, the BioC Viewer adjusts the offset to correctly align with the text. The BioC Viewer has a user-friendly interface, where most operations can be performed within a few mouse clicks. The feedback from BioGRID curators has been positive for the web interface, particularly for its usability and learnability.Database URL: http://viewer.bioqrator.org.

  7. BioC viewer: a web-based tool for displaying and merging annotations in BioC

    PubMed Central

    Shin, Soo-Yong; Kim, Sun; Wilbur, W. John; Kwon, Dongseop

    2016-01-01

    BioC is an XML-based format designed to provide interoperability for text mining tools and manual curation results. A challenge of BioC as a standard format is to align annotations from multiple systems. Ideally, this should not be a major problem if users follow guidelines given by BioC key files. Nevertheless, the misalignment between text and annotations happens quite often because different systems tend to use different software development environments, e.g. ASCII vs. Unicode. We first implemented the BioC Viewer to assist BioGRID curators as a part of the BioCreative V BioC track (Collaborative Biocurator Assistant Task). For the BioC track, the BioC Viewer helped curate protein-protein interaction and genetic interaction pairs appearing in full-text articles. Here, we describe the BioC Viewer itself as well as improvements made to the BioC Viewer since the BioCreative V Workshop to address the misalignment issue of BioC annotations. While uploading BioC files, a BioC merge process is offered when there are files from the same full-text article. If there is a mismatch between an annotated offset and text, the BioC Viewer adjusts the offset to correctly align with the text. The BioC Viewer has a user-friendly interface, where most operations can be performed within a few mouse clicks. The feedback from BioGRID curators has been positive for the web interface, particularly for its usability and learnability. Database URL: http://viewer.bioqrator.org PMID:27515823

  8. Param-Medic: A Tool for Improving MS/MS Database Search Yield by Optimizing Parameter Settings.

    PubMed

    May, Damon H; Tamura, Kaipo; Noble, William S

    2017-03-13

    In shotgun proteomics analysis, user-specified parameters are critical to database search performance and therefore to the yield of confident peptide-spectrum matches (PSMs). Two of the most important parameters are related to the accuracy of the mass spectrometer. Precursor mass tolerance defines the peptide candidates considered for each spectrum. Fragment mass tolerance or bin size determines how close observed and theoretical fragments must be to be considered a match. For either of these two parameters, too wide a setting yields randomly high-scoring false PSMs, whereas too narrow a setting erroneously excludes true PSMs, in both cases, lowering the yield of peptides detected at a given false discovery rate. We describe a strategy for inferring optimal search parameters by assembling and analyzing pairs of spectra that are likely to have been generated by the same peptide ion to infer precursor and fragment mass error. This strategy does not rely on a database search, making it usable in a wide variety of settings. In our experiments on data from a variety of instruments including Orbitrap and Q-TOF acquisitions, this strategy yields more high-confidence PSMs than using settings based on instrument defaults or determined by experts. Param-Medic is open-source and cross-platform. It is available as a standalone tool ( http://noble.gs.washington.edu/proj/param-medic/ ) and has been integrated into the Crux proteomics toolkit ( http://crux.ms ), providing automatic parameter selection for the Comet and Tide search engines.

  9. Database Administrator

    ERIC Educational Resources Information Center

    Moore, Pam

    2010-01-01

    The Internet and electronic commerce (e-commerce) generate lots of data. Data must be stored, organized, and managed. Database administrators, or DBAs, work with database software to find ways to do this. They identify user needs, set up computer databases, and test systems. They ensure that systems perform as they should and add people to the…

  10. Annotation and Classification of Argumentative Writing Revisions

    ERIC Educational Resources Information Center

    Zhang, Fan; Litman, Diane

    2015-01-01

    This paper explores the annotation and classification of students' revision behaviors in argumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with…

  11. [Analysis of the Cochrane Review: Interventions for Improving Upper Limb Function after Stroke. Cochrane Database Syst Rev. 2014,11:CD010820].

    PubMed

    Sousa Nanji, Liliana; Torres Cardoso, André; Costa, João; Vaz-Carneiro, António

    2015-01-01

    Impairment of the upper limbs is quite frequent after stroke, making rehabilitation an essential step towards clinical recovery and patient empowerment. This review aimed to synthetize existing evidence regarding interventions for upper limb function improvement after Stroke and to assess which would bring some benefit. The Cochrane Database of Systematic Reviews, the Database of Reviews of Effects and PROSPERO databases were searched until June 2013 and 40 reviews have been included, covering 503 studies, 18 078 participants and 18 interventions, as well as different doses and settings of interventions. The main results were: 1- Information currently available is insufficient to assess effectiveness of each intervention and to enable comparison of interventions; 2- Transcranial direct current stimulation brings no benefit for outcomes of activities of daily living; 3- Moderate-quality evidence showed a beneficial effect of constraint-induced movement therapy, mental practice, mirror therapy, interventions for sensory impairment, virtual reality and repetitive task practice; 4- Unilateral arm training may be more effective than bilateral arm training; 5- Moderate-quality evidence showed a beneficial effect of robotics on measures of impairment and ADLs; 6- There is no evidence of benefit or harm for technics such as repetitive transcranial magnetic stimulation, music therapy, pharmacological interventions, electrical stimulation and other therapies. Currently available evidence is insufficient and of low quality, not supporting clear clinical decisions. High-quality studies are still needed.

  12. SABER: The Searchable Annotated Bibliography of Education Research in Astronomy

    NASA Astrophysics Data System (ADS)

    Bruning, David; Bailey, Janelle M.; Brissenden, Gina

    Starting a new research project can be a challenge, but especially so in education research because the literature is scattered throughout many journals. Relevant astronomy education research may be in psychology journals, science education journals, physics education journals, or even in science journals. Tracking the vast realm of literature is difficult, especially because libraries frequently do not subscribe to many of the relevant journals and abstracting services. The Searchable Annotated Bibliography of Education Research (SABER) is an online resource that was started to service the needs of the astronomy education community, specifically to reduce this "scatter" by compiling an annotated bibliography of education research articles in one electronic location. Although SABER started in 2001, the database has a new URL—http://astronom- y.uwp.edu/saber/—and has recently undergone a major update.

  13. Effects of Teaching Strategies in Annotated Bibliography Writing

    ERIC Educational Resources Information Center

    Tan-de Ramos, Jennifer

    2015-01-01

    The study examines the effect of teaching strategies to improved writing of students in the tertiary level. Specifically, three teaching approaches--the use of modelling, grammar-based, and information element-focused--were tested on their effect on the writing of annotated bibliography in three research classes at a university in Manila.…

  14. Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs

    PubMed Central

    Maeda, Norihiro; Kasukawa, Takeya; Oyama, Rieko; Gough, Julian; Frith, Martin; Engström, Pär G; Lenhard, Boris; Aturaliya, Rajith N; Batalov, Serge; Beisel, Kirk W; Bult, Carol J; Fletcher, Colin F; Forrest, Alistair R. R; Furuno, Masaaki; Hill, David; Itoh, Masayoshi; Kanamori-Katayama, Mutsumi; Katayama, Shintaro; Katoh, Masaru; Kawashima, Tsugumi; Quackenbush, John; Ravasi, Timothy; Ring, Brian Z; Shibata, Kazuhiro; Sugiura, Koji; Takenaka, Yoichi; Teasdale, Rohan D; Wells, Christine A; Zhu, Yunxia; Kai, Chikatoshi; Kawai, Jun; Hume, David A; Carninci, Piero; Hayashizaki, Yoshihide

    2006-01-01

    The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts), providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species. PMID:16683036

  15. ParsEval: parallel comparison and analysis of gene structure annotations

    PubMed Central

    2012-01-01

    Background Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. Results ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. Conclusions Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation comparison to date. The source code is freely available (under an ISC license) at http://parseval.sourceforge.net/. PMID:22852583

  16. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  17. Incorporating Functional Annotations for Fine-Mapping Causal Variants in a Bayesian Framework Using Summary Statistics.

    PubMed

    Chen, Wenan; McDonnell, Shannon K; Thibodeau, Stephen N; Tillmans, Lori S; Schaid, Daniel J

    2016-11-01

    Functional annotations have been shown to improve both the discovery power and fine-mapping accuracy in genome-wide association studies. However, the optimal strategy to incorporate the large number of existing annotations is still not clear. In this study, we propose a Bayesian framework to incorporate functional annotations in a systematic manner. We compute the maximum a posteriori solution and use cross validation to find the optimal penalty parameters. By extending our previous fine-mapping method CAVIARBF into this framework, we require only summary statistics as input. We also derived an exact calculation of Bayes factors using summary statistics for quantitative traits, which is necessary when a large proportion of trait variance is explained by the variants of interest, such as in fine mapping expression quantitative trait loci (eQTL). We compared the proposed method with PAINTOR using different strategies to combine annotations. Simulation results show that the proposed method achieves the best accuracy in identifying causal variants among the different strategies and methods compared. We also find that for annotations with moderate effects from a large annotation pool, screening annotations individually and then combining the top annotations can produce overly optimistic results. We applied these methods on two real data sets: a meta-analysis result of lipid traits and a cis-eQTL study of normal prostate tissues. For the eQTL data, incorporating annotations significantly increased the number of potential causal variants with high probabilities.

  18. VHLdb: A database of von Hippel-Lindau protein interactors and mutations

    PubMed Central

    Tabaro, Francesco; Minervini, Giovanni; Sundus, Faiza; Quaglia, Federica; Leonardi, Emanuela; Piovesan, Damiano; Tosatto, Silvio C. E.

    2016-01-01

    Mutations in von Hippel-Lindau tumor suppressor protein (pVHL) predispose to develop tumors affecting specific target organs, such as the retina, epididymis, adrenal glands, pancreas and kidneys. Currently, more than 400 pVHL interacting proteins are either described in the literature or predicted in public databases. This data is scattered among several different sources, slowing down the comprehension of pVHL’s biological role. Here we present VHLdb, a novel database collecting available interaction and mutation data on pVHL to provide novel integrated annotations. In VHLdb, pVHL interactors are organized according to two annotation levels, manual and automatic. Mutation data are easily accessible and a novel visualization tool has been implemented. A user-friendly feedback function to improve database content through community-driven curation is also provided. VHLdb presently contains 478 interactors, of which 117 have been manually curated, and 1,074 mutations. This makes it the largest available database for pVHL-related information. VHLdb is available from URL: http://vhldb.bio.unipd.it/. PMID:27511743

  19. The Rat Genome Database, update 2007--easing the path from disease to data and back again.

    PubMed

    Twigger, Simon N; Shimoyama, Mary; Bromberg, Susan; Kwitek, Anne E; Jacob, Howard J

    2007-01-01

    The Rat Genome Database (RGD, http://rgd.mcw.edu) is one of the core resources for rat genomics and recent developments have focused on providing support for disease-based research using the rat model. Recognizing the importance of the rat as a disease model we have employed targeted curation strategies to curate genes, QTL and strain data for neurological and cardiovascular disease areas. This work has centered on rat but also includes data for mouse and human to create 'disease portals' that provide a unified view of the genes, QTL and strain models for these diseases across the three species. The disease curation efforts combined with normal curation activities have served to greatly increase the content of the database, particularly for biological information, including gene ontology, disease, pathway and phenotype ontology annotations. In addition to improving the features and database content, community outreach has been expanded to demonstrate how investigators can leverage the resources at RGD to facilitate their research and to elicit suggestions and needs for future developments. We have published a number of papers that provide additional information on the ontology annotations and the tools at RGD for data mining and analysis to better enable researchers to fully utilize the database.

  20. The Biofuel Feedstock Genomics Resource: a web-based portal and database to enable functional genomics of plant biofuel feedstock species.

    PubMed

    Childs, Kevin L; Konganti, Kranti; Buell, C Robin

    2012-01-01

    Major feedstock sources for future biofuel production are likely to be high biomass producing plant species such as poplar, pine, switchgrass, sorghum and maize. One active area of research in these species is genome-enabled improvement of lignocellulosic biofuel feedstock quality and yield. To facilitate genomic-based investigations in these species, we developed the Biofuel Feedstock Genomic Resource (BFGR), a database and web-portal that provides high-quality, uniform and integrated functional annotation of gene and transcript assembly sequences from species of interest to lignocellulosic biofuel feedstock researchers. The BFGR includes sequence data from 54 species and permits researchers to view, analyze and obtain annotation at the gene, transcript, protein and genome level. Annotation of biochemical pathways permits the identification of key genes and transcripts central to the improvement of lignocellulosic properties in these species. The integrated nature of the BFGR in terms of annotation methods, orthologous/paralogous relationships and linkage to seven species with complete genome sequences allows comparative analyses for biofuel feedstock species with limited sequence resources. Database URL: http://bfgr.plantbiology.msu.edu.

  1. Cancer Stem Cells Therapeutic Target Database: The First Comprehensive Database for Therapeutic Targets of Cancer Stem Cells.

    PubMed

    Hu, Xiaoqing; Cong, Ye; Luo, Huizhe Howard; Wu, Sijin; Zhao, Liyuan Eric; Liu, Quentin; Yang, Yongliang

    2016-09-02

    SummaryCancer stem cells (CSCs) are a subpopulation of tumor cells that have strong self-renewal capabilities and may contribute to the failure of conventional cancer therapies. Hence, therapeutics homing in on CSCs represent a novel and promising approach that may eradicate malignant tumors. However, the lack of information on validated targets of CSCs has greatly hindered the development of CSC-directed therapeutics. Herein, we describe the Cancer Stem Cells Therapeutic Target Database (CSCTT), the first online database to provide a rich bioinformatics resource for the display, search, and analysis of structure, function, and related annotation for therapeutic targets of cancer stem cells. CSCTT contains 135 proteins that are potential targets of CSCs, with validated experimental evidence manually curated from existing literatures. Proteins are carefully annotated with a detailed description of protein families, biological process, related diseases, and experimental evidences. In addition, CSCTT has compiled 213 documented therapeutic methods for cancer stem cells, including 118 small molecules and 20 biotherapy methods. The CSCTT may serve as a useful platform for the development of CSC-directed therapeutics against various malignant tumors. The CSCTT database is freely available to the public at http://www.csctt.org/ SIGNIFICANCE: Although the definition and role of cancer stem cells (CSCs, also called tumor-initiating cells) remain a topic of much debate, increasing evidence suggests that CSCs may be the driving force behind chemotherapy/radiotherapy resistance, as well as metastasis. Consequently, the elimination or differentiation of CSCs is critical for treating malignant tumors and improving clinical outcomes. Unfortunately, the progress of research into the development of anti-CSC therapeutics has been rather slow, and no anti-CSC drugs are yet in clinical use. Hence, there is an urgent need to develop a database that compiles useful information for

  2. Automatic annotation of organellar genomes with DOGMA

    SciTech Connect

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  3. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  4. snpGeneSets: An R Package for Genome-Wide Study Annotation.

    PubMed

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-12-07

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/.

  5. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files.

    PubMed

    Hart, Steven N; Duffy, Patrick; Quest, Daniel J; Hossain, Asif; Meiners, Mike A; Kocher, Jean-Pierre

    2016-03-01

    Next-generation sequencing platforms are widely used to discover variants associated with disease. The processing of sequencing data involves read alignment, variant calling, variant annotation and variant filtering. The standard file format to hold variant calls is the variant call format (VCF) file. According to the format specifications, any arbitrary annotation can be added to the VCF file for downstream processing. However, most downstream analysis programs disregard annotations already present in the VCF and re-annotate variants using the annotation provided by that particular program. This precludes investigators who have collected information on variants from literature or other sources from including these annotations in the filtering and mining of variants. We have developed VCF-Miner, a graphical user interface-based stand-alone tool, to mine variants and annotation stored in the VCF. Powered by a MongoDB database engine, VCF-Miner enables the stepwise trimming of non-relevant variants. The grouping feature implemented in VCF-Miner can be used to identify somatic variants by contrasting variants in tumor and in normal samples or to identify recessive/dominant variants in family studies. It is not limited to human data, but can also be extended to include non-diploid organisms. It also supports copy number or any other variant type supported by the VCF specification. VCF-Miner can be used on a personal computer or large institutional servers and is freely available for download from http://bioinformaticstools.mayo.edu/research/vcf-miner/.

  6. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Tennessen, Kristin; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2016-01-01

    The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases.

  7. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/.

  8. PvTFDB: a Phaseolus vulgaris transcription factors database for expediting functional genomics in legumes

    PubMed Central

    Bhawna; Bonthala, V.S.; Gajula, MNV Prasad

    2016-01-01

    The common bean [Phaseolus vulgaris (L.)] is one of the essential proteinaceous vegetables grown in developing countries. However, its production is challenged by low yields caused by numerous biotic and abiotic stress conditions. Regulatory transcription factors (TFs) symbolize a key component of the genome and are the most significant targets for producing stress tolerant crop and hence functional genomic studies of these TFs are important. Therefore, here we have constructed a web-accessible TFs database for P. vulgaris, called PvTFDB, which contains 2370 putative TF gene models in 49 TF families. This database provides a comprehensive information for each of the identified TF that includes sequence data, functional annotation, SSRs with their primer sets, protein physical properties, chromosomal location, phylogeny, tissue-specific gene expression data, orthologues, cis-regulatory elements and gene ontology (GO) assignment. Altogether, this information would be used in expediting the functional genomic studies of a specific TF(s) of interest. The objectives of this database are to understand functional genomics study of common bean TFs and recognize the regulatory mechanisms underlying various stress responses to ease breeding strategy for variety production through a couple of search interfaces including gene ID, functional annotation and browsing interfaces including by family and by chromosome. This database will also serve as a promising central repository for researchers as well as breeders who are working towards crop improvement of legume crops. In addition, this database provide the user unrestricted public access and the user can download entire data present in the database freely. Database URL: http://www.multiomics.in/PvTFDB/ PMID:27465131

  9. PvTFDB: a Phaseolus vulgaris transcription factors database for expediting functional genomics in legumes.

    PubMed

    Bhawna; Bonthala, V S; Gajula, Mnv Prasad

    2016-01-01

    The common bean [Phaseolus vulgaris (L.)] is one of the essential proteinaceous vegetables grown in developing countries. However, its production is challenged by low yields caused by numerous biotic and abiotic stress conditions. Regulatory transcription factors (TFs) symbolize a key component of the genome and are the most significant targets for producing stress tolerant crop and hence functional genomic studies of these TFs are important. Therefore, here we have constructed a web-accessible TFs database for P. vulgaris, called PvTFDB, which contains 2370 putative TF gene models in 49 TF families. This database provides a comprehensive information for each of the identified TF that includes sequence data, functional annotation, SSRs with their primer sets, protein physical properties, chromosomal location, phylogeny, tissue-specific gene expression data, orthologues, cis-regulatory elements and gene ontology (GO) assignment. Altogether, this information would be used in expediting the functional genomic studies of a specific TF(s) of interest. The objectives of this database are to understand functional genomics study of common bean TFs and recognize the regulatory mechanisms underlying various stress responses to ease breeding strategy for variety production through a couple of search interfaces including gene ID, functional annotation and browsing interfaces including by family and by chromosome. This database will also serve as a promising central repository for researchers as well as breeders who are working towards crop improvement of legume crops. In addition, this database provide the user unrestricted public access and the user can download entire data present in the database freely.Database URL: http://www.multiomics.in/PvTFDB/.

  10. AERONET Version 3 Release: Providing Significant Improvements for Multi-Decadal Global Aerosol Database and Near Real-Time Validation

    NASA Technical Reports Server (NTRS)

    Holben, Brent; Slutsker, Ilya; Giles, David; Eck, Thomas; Smirnov, Alexander; Sinyuk, Aliaksandr; Schafer, Joel; Sorokin, Mikhail; Rodriguez, Jon; Kraft, Jason; Scully, Amy

    2016-01-01

    Aerosols are highly variable in space, time and properties. Global assessment from satellite platforms and model predictions rely on validation from AERONET, a highly accurate ground-based network. Ver. 3 represents a significant improvement in accuracy and quality.

  11. Ribosomal Database Project II

    DOE Data Explorer

    The Ribosomal Database Project (RDP) provides ribosome related data and services to the scientific community, including online data analysis and aligned and annotated Bacterial small-subunit 16S rRNA sequences. As of March 2008, RDP Release 10 is available and currently (August 2009) contains 1,074,075 aligned 16S rRNA sequences. Data that can be downloaded include zipped GenBank and FASTA alignment files, a histogram (in Excel) of the number of RDP sequences spanning each base position, data in the Functional Gene Pipeline Repository, and various user submitted data. The RDP-II website also provides numerous analysis tools.[From the RDP-II home page at http://rdp.cme.msu.edu/index.jsp

  12. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  13. What's in a Name? A Coordinated Approach toward the Correct Use of a Uniform Nomenclature to Improve Patient Reports and Databases.

    PubMed

    Tack, Véronique; Deans, Zandra C; Wolstenholme, Nicola; Patton, Simon; Dequeker, Elisabeth M C

    2016-06-01

    The Human Genome Variation Society (HGVS) recommendations provide standardized nomenclature for reporting variants. This should be encouraged in molecular pathology-both for issuing diagnostic reports and for correct data recording in electronic databases. Many providers of external quality assessment (EQA) promote the correct use of HGVS nomenclature by scoring variant descriptions used in EQA reports. This study focuses on the type and impact of variant nomenclature errors. An assessment was made of EGFR gene variant nomenclature by four EQA providers (European Society of Pathology [ESP], European Molecular Genetics Quality Network [EMQN], United Kingdom National External Quality Assessment Service for Molecular Genetics, and the French national Gen&Tiss EQA scheme) for two EQA distributions. Laboratories testing for oncology biomarkers make different errors when describing EGFR gene variants. Significant differences were observed regarding inclusion of the correct reference sequence: EMQN participants made fewer errors compared to ESP EQA participants (P-value = 0.015). The analysis of ESP EQA participants showed significant improvement over 2 years (P-value = 0.016). Results demonstrate the need for improvement of variant reporting according to HGVS guidelines. Consequences of using incorrect mutation nomenclature are currently perceived as low by many laboratories, but the impact will rise with an increased reliance on databases to assist in result analysis.

  14. Studying Oogenesis in a Non-model Organism Using Transcriptomics: Assembling, Annotating, and Analyzing Your Data.

    PubMed

    Carter, Jean-Michel; Gibbs, Melanie; Breuker, Casper J

    2016-01-01

    This chapter provides a guide to processing and analyzing RNA-Seq data in a non-model organism. This approach was implemented for studying oogenesis in the Speckled Wood Butterfly Pararge aegeria. We focus in particular on how to perform a more informative primary annotation of your non-model organism by implementing our multi-BLAST annotation strategy. We also provide a general guide to other essential steps in the next-generation sequencing analysis workflow. Before undertaking these methods, we recommend you familiarize yourself with command line usage and fundamental concepts of database handling. Most of the operations in the primary annotation pipeline can be performed in Galaxy (or equivalent standalone versions of the tools) and through the use of common database operations (e.g. to remove duplicates) but other equivalent programs and/or custom scripts can be implemented for further automation.

  15. An editing environment for DNA sequence analysis and annotation

    SciTech Connect

    Uberbacher, E.C.; Xu, Y.; Shah, M.B.; Olman, V.; Parang, M.; Mural, R.

    1998-12-31

    This paper presents a computer system for analyzing and annotating large-scale genomic sequences. The core of the system is a multiple-gene structure identification program, which predicts the most probable gene structures based on the given evidence, including pattern recognition, EST and protein homology information. A graphics-based user interface provides an environment which allows the user to interactively control the evidence to be used in the gene identification process. To overcome the computational bottleneck in the database similarity search used in the gene identification process, the authors have developed an effective way to partition a database into a set of sub-databases of related sequences, and reduced the search problem on a large database to a signature identification problem and a search problem on a much smaller sub-database. This reduces the number of sequences to be searched from N to O({radical}N) on average, and hence greatly reduces the search time, where N is the number of sequences in the original database. The system provides the user with the ability to facilitate and modify the analysis and modeling in real time.

  16. BIOFILTER AS A FUNCTIONAL ANNOTATION PIPELINE FOR COMMON AND RARE COPY NUMBER BURDEN

    PubMed Central

    KIM, DOKYOON; LUCAS, ANASTASIA; GLESSNER, JOSEPH; VERMA, SHEFALI S.; BRADFORD, YUKI; LI, RUOWANG; FRASE, ALEX T.; HAKONARSON, HAKON; PEISSIG, PEGGY; BRILLIANT, MURRAY; RITCHIE, MARYLYN D.

    2015-01-01

    Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter – a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record – total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the

  17. BIOFILTER AS A FUNCTIONAL ANNOTATION PIPELINE FOR COMMON AND RARE COPY NUMBER BURDEN.

    PubMed

    Kim, Dokyoon; Lucas, Anastasia; Glessner, Joseph; Verma, Shefali S; Bradford, Yuki; Li, Ruowang; Frase, Alex T; Hakonarson, Hakon; Peissig, Peggy; Brilliant, Murray; Ritchie, Marylyn D

    2016-01-01

    Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter - a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record - total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the interpretability of

  18. LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana.

    PubMed

    Rastogi, Shruti; Rost, Burkhard

    2011-01-01

    LocDB is a manually curated database with experimental annotations for the subcellular localizations of proteins in Homo sapiens (HS, human) and Arabidopsis thaliana (AT, thale cress). Currently, it contains entries for 19,604 UniProt proteins (HS: 13,342; AT: 6262). Each database entry contains the experimentally derived localization in Gene Ontology (GO) terminology, the experimental annotation of localization, localization predictions by state-of-the-art methods and, where available, the type of experimental information. LocDB is searchable by keyword, protein name and subcellular compartment, as well as by identifiers from UniProt, Ensembl and TAIR resources. In comparison to other public databases, LocDB as a resource adds about 10,000 experimental localization annotations for HS proteins and ∼900 for AS proteins. Over 40% of the proteins in LocDB have multiple localization annotations providing a better platform for development of new multiple localization prediction methods with higher coverage and accuracy. Links to all referenced databases are provided. LocDB will be updated regularly by our group (available at: http://www.rostlab.org/services/locDB).

  19. Adding Value to Large Multimedia Collections through Annotation Technologies and Tools: Serving Communities of Interest.

    ERIC Educational Resources Information Center

    Shabajee, Paul; Miller, Libby; Dingley, Andy

    A group of research projects based at HP-Labs Bristol, the University of Bristol (England) and ARKive (a new large multimedia database project focused on the worlds biodiversity based in the United Kingdom) are working to develop a flexible model for the indexing of multimedia collections that allows users to annotate content utilizing extensible…

  20. CBMAR: a comprehensive β-lactamase molecular annotation resource

    PubMed Central

    Srivastava, Abhishikha; Singhal, Neelja; Goel, Manisha; Virdi, Jugsharan Singh; Kumar, Manish

    2014-01-01

    β-Lactam antibiotics are among the most widely used antibiotics against microbial pathogens. However, enzymatic hydrolysis of these antibiotics by bacterial β-lactamases is increasingly compromising their efficiency. Although new generation β-lactam antibiotics have been developed to combat antibiotic resistance, β-lactamases have also evolved along with the new variants of the substrate. A strong selection pressure from the newer generation of β-lactam antibiotics has resulted in evolution of different families within each class of β-lactamase. To facilitate detailed characterization of different families of β-lactamases, we have created a database, CBMAR, which facilitates comprehensive molecular annotation and discovery of novel β-lactamases. As against the limited scope of other existing similar databases, CBMAR provides information useful for molecular and biochemical characterization of each family of β-lactamase. The basic architecture of CBMAR is based on Ambler classification, which divides β-lactamases as serine (Classes A, C and D) and metallo-β-lactamases (Class B). Each class is further divided into several families on the basis of their hydrolytic character. In CBMAR, each family is annotated with (i) sequence variability, (ii) antibiotic resistance profile, (iii) inhibitor susceptibility, (iv) active site, (v) family fingerprints, (vi) mutational profile, (vii) variants, (viii) gene location, (ix) phylogenetic tree and several other features. Each entry also has external links to the relevant protein/nucleotide sequence and structure databases. The database also supports sequence similarity searches using BLAST and assigns a new β-lactamase protein to its respective family on the basis of family-specific fingerprint. Database URL: http://14.139.227.92/mkumar/lactamasedb PMID:25475113

  1. A Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites

    PubMed Central

    Overmars, Lex; Siezen, Roland J.; Francke, Christof

    2015-01-01

    The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The method is based on a comparison of the observed and expected distribution of all TISs in a particular genome given prior gene-calling. We have assessed the TIS annotations for all available NCBI RefSeq microbial genomes and found that approximately 87% is of appropriate quality, whereas 13% needs substantial improvement. We have analyzed a number of factors that could affect TIS annotation quality such as GC-content, taxonomy, the fraction of genes with a Shine-Dalgarno sequence and the year of publication. The analysis showed that only the first factor has a clear effect. We have then formulated a straightforward Principle Component Analysis-based TIS identification strategy to self-organize and score potential TISs. The strategy is independent of reference data and a priori calculations. A representative set of 277 genomes was subjected to the analysis and we found a clear increase in TIS annotation quality for the genomes with a low quality score. The PCA-based annotation was also compared with annotation with the current tool of reference, Prodigal. The comparison for the model genome of Escherichia coli K12 showed that both methods supplement each other and that prediction agreement can be used as an indicator of a correct TIS annotation. Importantly, the data suggest that the addition of a PCA-based strategy to a Prodigal prediction can be used to ‘flag’ TIS annotations for re-evaluation and in addition can be used to evaluate a given annotation in case a Prodigal annotation is lacking. PMID:26204119

  2. Mining a database of single amplified genomes from Red Sea brine pool extremophiles-improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA).

    PubMed

    Grötzinger, Stefan W; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available

  3. Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

    PubMed Central

    Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available

  4. Image Databases.

    ERIC Educational Resources Information Center

    Pettersson, Rune

    Different kinds of pictorial databases are described with respect to aims, user groups, search possibilities, storage, and distribution. Some specific examples are given for databases used for the following purposes: (1) labor markets for artists; (2) document management; (3) telling a story; (4) preservation (archives and museums); (5) research;…

  5. Maize databases

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This chapter is a succinct overview of maize data held in the species-specific database MaizeGDB (the Maize Genomics and Genetics Database), and selected multi-species data repositories, such as Gramene/Ensembl Plants, Phytozome, UniProt and the National Center for Biotechnology Information (NCBI), ...

  6. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs)

    PubMed Central

    Natale, Darren A; Shankavaram, Uma T; Galperin, Michael Y; Wolf, Yuri I; Aravind, L; Koonin, Eugene V

    2000-01-01

    Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully

  7. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.

  8. Hydrogen Leak Detection Sensor Database

    NASA Technical Reports Server (NTRS)

    Baker, Barton D.

    2010-01-01

    This slide presentation reviews the characteristics of the Hydrogen Sensor database. The database is the result of NASA's continuing interest in and improvement of its ability to detect and assess gas leaks in space applications. The database specifics and a snapshot of an entry in the database are reviewed. Attempts were made to determine the applicability of each of the 65 sensors for ground and/or vehicle use.

  9. A computational interactome and functional annotation for the human proteome

    PubMed Central

    Garzón, José Ignacio; Deng, Lei; Murray, Diana; Shapira, Sagi; Petrey, Donald; Honig, Barry

    2016-01-01

    We present a database, PrePPI (Predicting Protein-Protein Interactions), of more than 1.35 million predicted protein-protein interactions (PPIs). Of these at least 127,000 are expected to constitute direct physical interactions although the actual number may be much larger (~500,000). The current PrePPI, which contains predicted interactions for about 85% of the human proteome, is related to an earlier version but is based on additional sources of interaction evidence and is far larger in scope. The use of structural relationships allows PrePPI to infer numerous previously unreported interactions. PrePPI has been subjected to a series of validation tests including reproducing known interactions, recapitulating multi-protein complexes, analysis of disease associated SNPs, and identifying functional relationships between interacting proteins. We show, using Gene Set Enrichment Analysis (GSEA), that predicted interaction partners can be used to annotate a protein’s function. We provide annotations for most human proteins, including many annotated as having unknown function. DOI: http://dx.doi.org/10.7554/eLife.18715.001 PMID:27770567

  10. Rice Glycosyltransferase (GT) Phylogenomic Database

    DOE Data Explorer

    Ronald, Pamela

    The Ronald Laboratory staff at the University of California-Davis has a primary research focus on the genes of the rice plant. They study the role that genetics plays in the way rice plants respond to their environment. They created the Rice GT Database in order to integrate functional genomic information for putative rice Glycosyltransferases (GTs). This database contains information on nearly 800 putative rice GTs (gene models) identified by sequence similarity searches based on the Carbohydrate Active enZymes (CAZy) database. The Rice GT Database provides a platform to display user-selected functional genomic data on a phylogenetic tree. This includes sequence information, mutant line information, expression data, etc. An interactive chromosomal map shows the position of all rice GTs, and links to rice annotation databases are included. The format is intended to "facilitate the comparison of closely related GTs within different families, as well as perform global comparisons between sets of related families." [From http://ricephylogenomics.ucdavis.edu/cellwalls/gt/genInfo.shtml] See also the primary paper discussing this work: Peijian Cao, Laura E. Bartley, Ki-Hong Jung and Pamela C. Ronalda. Construction of a Rice Glycosyltransferase Phylogenomic Database and Identification of Rice-Diverged Glycosyltransferases. Molecular Plant, 2008, 1(5): 858-877.

  11. Automatically Annotating Topics in Transcripts of Patient-Provider Interactions via Machine Learning

    PubMed Central

    Wallace, Byron C.; Laws, M. Barton; Small, Kevin; Wilson, Ira B.; Trikalinos, Thomas A.

    2013-01-01

    Background Annotated patient-provider encounters can provide important insights into clinical communication, ultimately suggesting how it might be improved to effect better health outcomes. But annotating outpatient transcripts with Roter or General Medical Interaction Analysis System (GMIAS) codes is expensive, limiting the scope of such analyses. We propose automatically annotating transcripts of patient-provider interactions with topic codes via machine learning. Methods We use a conditional random field (CRF) to model utterance topic probabilities. The model accounts for the sequential structure of conversations and the words comprising utterances. We assess predictive performance via 10- fold cross-validation over GMIAS-annotated transcripts of 360 outpatient visits (over 230,000 utterances). We then used automated in place of manual annotations to reproduce an analysis of 116 additional visits from a randomized trial that used GMIAS to assess the efficacy of an intervention aimed at improving communication around antiretroviral (ARV) adherence. Results With respect to six topic codes, the CRF achieved a mean pairwise kappa compared with human annotators of 0.49 (range: 0.47, 0.53) and a mean overall accuracy of 0.64 (range: 0.62, 0.66). With respect to the RCT re-analysis, results using automated annotations agreed with those obtained using manual ones. According to the manual annotations, the median number of ARV-related utterances without and with the intervention was 49.5 versus 76, respectively (paired sign test p=0.07). Using automated annotations, the respective numbers were 39 versus 55 (p=0.04). Limitations While moderately accurate, the predicted annotations are far from perfect. Conversational topics are intermediate outcomes; their utility is still being researched. Conclusions This foray into automated topic inference suggests that machine learning methods can classify utterances comprising patient-provider interactions into clinically relevant

  12. Annotating Socio-Cultural Structures in Text

    DTIC Science & Technology

    2012-10-31

    from the traditional k-Nearest Neighbor (kNN) algorithm. Using experiments on three different multi-label learning problems, i.e. Yeast gene ...annotated NP/ VP Pane: Shows the sentence parsed using the Parts of Speech tagger Document View Pane: Specifies the document (being annotated) in three...used to annotate the document. In the current application we use the Level 1, Level 2 taxonomy. New concepts may be added to or deleted from the

  13. Survival improvement in patients with pancreatic cancer by decade: A period analysis of the SEER database, 1981–2010

    PubMed Central

    Sun, Huanhuan; Ma, Haiqing; Hong, Guobin; Sun, Hongliu; Wang, Jin

    2014-01-01

    Pancreatic cancer (PaCa) is an aggressive malignancy with a high mortality rate and a poor prognos