Science.gov

Sample records for genome annotation expert

  1. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  2. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  3. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  4. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. PMID:27342282

  5. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

  6. Automatic annotation of organellar genomes with DOGMA

    SciTech Connect

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  7. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine

    PubMed Central

    Elsik, Christine G.; Tayal, Aditi; Diesh, Colin M.; Unni, Deepak R.; Emery, Marianne L.; Nguyen, Hung N.; Hagen, Darren E.

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  8. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    PubMed

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  9. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    PubMed

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search.

  10. JGI Plant Genomics Gene Annotation Pipeline

    SciTech Connect

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David; Hayes, David; Mitros, Therese

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward this aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.

  11. MicroScope: a platform for microbial genome annotation and comparative genomics.

    PubMed

    Vallenet, D; Engelen, S; Mornico, D; Cruveiller, S; Fleury, L; Lajus, A; Rouy, Z; Roche, D; Salvignol, G; Scarpelli, C; Médigue, C

    2009-01-01

    The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope's rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of

  12. Genepi: a blackboard framework for genome annotation

    PubMed Central

    Descorps-Declère, Stéphane; Ziébelin, Danielle; Rechenmann, François; Viari, Alain

    2006-01-01

    Background Genome annotation can be viewed as an incremental, cooperative, data-driven, knowledge-based process that involves multiple methods to predict gene locations and structures. This process might have to be executed more than once and might be subjected to several revisions as the biological (new data) or methodological (new methods) knowledge evolves. In this context, although a lot of annotation platforms already exist, there is still a strong need for computer systems which take in charge, not only the primary annotation, but also the update and advance of the associated knowledge. In this paper, we propose to adopt a blackboard architecture for designing such a system Results We have implemented a blackboard framework (called Genepi) for developing automatic annotation systems. The system is not bound to any specific annotation strategy. Instead, the user will specify a blackboard structure in a configuration file and the system will instantiate and run this particular annotation strategy. The characteristics of this framework are presented and discussed. Specific adaptations to the classical blackboard architecture have been required, such as the description of the activation patterns of the knowledge sources by using an extended set of Allen's temporal relations. Although the system is robust enough to be used on real-size applications, it is of primary use to bioinformatics researchers who want to experiment with blackboard architectures. Conclusion In the context of genome annotation, blackboards have several interesting features related to the way methodological and biological knowledge can be updated. They can readily handle the cooperative (several methods are implied) and opportunistic (the flow of execution depends on the state of our knowledge) aspects of the annotation process. PMID:17038181

  13. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    SciTech Connect

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.

  14. Functional Annotation Analytics of Rhodopseudomonas palustris Genomes

    PubMed Central

    Simmons, Shaneka S.; Isokpehi, Raphael D.; Brown, Shyretha D.; McAllister, Donee L.; Hall, Charnia C.; McDuffy, Wanaki M.; Medley, Tamara L.; Udensi, Udensi K.; Rajnarayanan, Rajendram V.; Ayensu, Wellington K.; Cohly, Hari H.P.

    2011-01-01

    Rhodopseudomonas palustris, a nonsulphur purple photosynthetic bacteria, has been extensively investigated for its metabolic versatility including ability to produce hydrogen gas from sunlight and biomass. The availability of the finished genome sequences of six R. palustris strains (BisA53, BisB18, BisB5, CGA009, HaA2 and TIE-1) combined with online bioinformatics software for integrated analysis presents new opportunities to determine the genomic basis of metabolic versatility and ecological lifestyles of the bacteria species. The purpose of this investigation was to compare the functional annotations available for multiple R. palustris genomes to identify annotations that can be further investigated for strain-specific or uniquely shared phenotypic characteristics. A total of 2,355 protein family Pfam domain annotations were clustered based on presence or absence in the six genomes. The clustering process identified groups of functional annotations including those that could be verified as strain-specific or uniquely shared phenotypes. For example, genes encoding water/glycerol transport were present in the genome sequences of strains CGA009 and BisB5, but absent in strains BisA53, BisB18, HaA2 and TIE-1. Protein structural homology modeling predicted that the two orthologous 240 aa R. palustris aquaporins have water-specific transport function. Based on observations in other microbes, the presence of aquaporin in R. palustris strains may improve freeze tolerance in natural conditions of rapid freezing such as nitrogen fixation at low temperatures where access to liquid water is a limiting factor for nitrogenase activation. In the case of adaptive loss of aquaporin genes, strains may be better adapted to survive in conditions of high-sugar content such as fermentation of biomass for biohydrogen production. Finally, web-based resources were developed to allow for interactive, user-defined selection of the relationship between protein family annotations and the R

  15. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    SciTech Connect

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  16. Genomic Data and Annotation from the SEED

    DOE Data Explorer

    Fonstein, Michael; Kogan, Yakov; Osterman, Andrei; Overbeek, Ross; Vonstein, Veronika The Fellowship for Interpretation of Genomes (FIG)

    The SEED Project is a cooperative effort to annotate ever-expanding genomic data so researchers can conduct effective comparative analyses of genomes. Launched in 2003 by the Fellowship for Interpretation of Genomes (FIG), the project is one of several initiatives in ongoing development of data curation systems. SEED is designed to be used by scientists from numerous centers and with varied research objectives. As such, several institutions have since joined FIG in a consortium, including the University of Chicago, DOE’s Argonne National Laboratory (ANL), the University of Illinois at Urbana-Champaign, and others. As one example, ANL has used SEED to develop the National Microbial Pathogen Data Resource. Other agencies and institutions have used the project to discover genome components and clarify gene functions such as metabolism. SEED also has enabled researchers to conduct comparative analyses of closely related genomes and has supported derivation of stoichiometric models to understand metabolic processes. The SEED Project has been extended to support metagenomic samples and concomitant analytical tools. Moreover, the number of genomes being introduced into SEED is growing very rapidly. Building a framework to support this growth while providing highly accurate annotations is centrally important to SEED. The project’s subsystem-based annotation strategy has become the technological foundation for addressing these challenges.(copied from Appendix 7 of Systems Biology Knowledgebase for a New Era in Biology, A Genomics:GTL Report from the May 2008 Workshop, DOE/SC-0113, Grequrick, S; Fredrickson, J.K.; Stevens, R., Pub March 1, 2009.)

  17. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics

    PubMed Central

    Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R.; Williams, Angela H.; Tan, Yew-Foon; Testa, Alison C.; Gorman, Jeffrey J.; Oliver, Richard P.

    2016-01-01

    Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models. PMID:26840125

  18. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics.

    PubMed

    Syme, Robert A; Tan, Kar-Chun; Hane, James K; Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R; Williams, Angela H; Tan, Yew-Foon; Testa, Alison C; Gorman, Jeffrey J; Oliver, Richard P

    2016-01-01

    Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models.

  19. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  20. Towards a Library of Standard Operating Procedures (SOPs) for (meta)genomic annotation

    SciTech Connect

    Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen

    2008-04-01

    Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.

  1. The Vertebrate Genome Annotation browser 10 years on

    PubMed Central

    Harrow, Jennifer L.; Steward, Charles A.; Frankish, Adam; Gilbert, James G.; Gonzalez, Jose M.; Loveland, Jane E.; Mudge, Jonathan; Sheppard, Dan; Thomas, Mark; Trevanion, Stephen; Wilming, Laurens G.

    2014-01-01

    The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC). PMID:24316575

  2. Correction of the Caulobacter crescentus NA1000 genome annotation.

    PubMed

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  3. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

    PubMed Central

    Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157

  4. Assessing the Drosophila melanogaster and Anopheles gambiae Genome Annotations Using Genome-Wide Sequence Comparisons

    PubMed Central

    Jaillon, Olivier; Dossat, Carole; Eckenberg, Ralph; Eiglmeier, Karin; Segurens, Béatrice; Aury, Jean-Marc; Roth, Charles W.; Scarpelli, Claude; Brey, Paul T.; Weissenbach, Jean; Wincker, Patrick

    2003-01-01

    We performed genome-wide sequence comparisons at the protein coding level between the genome sequences of Drosophila melanogaster and Anopheles gambiae. Such comparisons detect evolutionarily conserved regions (ecores) that can be used for a qualitative and quantitative evaluation of the available annotations of both genomes. They also provide novel candidate features for annotation. The percentage of ecores mapping outside annotations in the A. gambiae genome is about fourfold higher than in D. melanogaster. The A. gambiae genome assembly also contains a high proportion of duplicated ecores, possibly resulting from artefactual sequence duplications in the genome assembly. The occurrence of 4063 ecores in the D. melanogaster genome outside annotations suggests that some genes are not yet or only partially annotated. The present work illustrates the power of comparative genomics approaches towards an exhaustive and accurate establishment of gene models and gene catalogues in insect genomes. PMID:12840038

  5. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus

  6. The 2008 update of the Aspergillus nidulans genome annotation: a community effort.

    PubMed

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita; Deegan, Jennifer; Clutterbuck, John; Andersen, Mikael R; Archer, David; Bencina, Mojca; Braus, Gerhard; Coutinho, Pedro; von Döhren, Hans; Doonan, John; Driessen, Arnold J M; Durek, Pawel; Espeso, Eduardo; Fekete, Erzsébet; Flipphi, Michel; Estrada, Carlos Garcia; Geysens, Steven; Goldman, Gustavo; de Groot, Piet W J; Hansen, Kim; Harris, Steven D; Heinekamp, Thorsten; Helmstaedt, Kerstin; Henrissat, Bernard; Hofmann, Gerald; Homan, Tim; Horio, Tetsuya; Horiuchi, Hiroyuki; James, Steve; Jones, Meriel; Karaffa, Levente; Karányi, Zsolt; Kato, Masashi; Keller, Nancy; Kelly, Diane E; Kiel, Jan A K W; Kim, Jung-Mi; van der Klei, Ida J; Klis, Frans M; Kovalchuk, Andriy; Krasevec, Nada; Kubicek, Christian P; Liu, Bo; Maccabe, Andrew; Meyer, Vera; Mirabito, Pete; Miskei, Márton; Mos, Magdalena; Mullins, Jonathan; Nelson, David R; Nielsen, Jens; Oakley, Berl R; Osmani, Stephen A; Pakula, Tiina; Paszewski, Andrzej; Paulsen, Ian; Pilsyk, Sebastian; Pócsi, István; Punt, Peter J; Ram, Arthur F J; Ren, Qinghu; Robellet, Xavier; Robson, Geoff; Seiboth, Bernhard; van Solingen, Piet; Specht, Thomas; Sun, Jibin; Taheri-Talesh, Naimeh; Takeshita, Norio; Ussery, Dave; vanKuyk, Patricia A; Visser, Hans; van de Vondervoort, Peter J I; de Vries, Ronald P; Walton, Jonathan; Xiang, Xin; Xiong, Yi; Zeng, An Ping; Brandt, Bernd W; Cornell, Michael J; van den Hondel, Cees A M J J; Visser, Jacob; Oliver, Stephen G; Turner, Geoffrey

    2009-03-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology.

  7. The 2008 update of the Aspergillus nidulans genome annotation: a community effort

    PubMed Central

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita; Deegan, Jennifer; Clutterbuck, John; Andersen, Mikael R.; Archer, David; Bencina, Mojca; Braus, Gerhard; Coutinho, Pedro; von Döhren, Hans; Doonan, John; Driessen, Arnold J.M.; Durek, Pawel; Espeso, Eduardo; Fekete, Erzsébet; Flipphi, Michel; Estrada, Carlos Garcia; Geysens, Steven; Goldman, Gustavo; de Groot, Piet W.J.; Hansen, Kim; Harris, Steven D.; Heinekamp, Thorsten; Helmstaedt, Kerstin; Henrissat, Bernard; Hofmann, Gerald; Homan, Tim; Horio, Tetsuya; Horiuchi, Hiroyuki; James, Steve; Jones, Meriel; Karaffa, Levente; Karányi, Zsolt; Kato, Masashi; Keller, Nancy; Kelly, Diane E.; Kiel, Jan A.K.W.; Kim, Jung-Mi; van der Klei, Ida J.; Klis, Frans M.; Kovalchuk, Andriy; Kraševec, Nada; Kubicek, Christian P.; Liu, Bo; MacCabe, Andrew; Meyer, Vera; Mirabito, Pete; Miskei, Márton; Mos, Magdalena; Mullins, Jonathan; Nelson, David R.; Nielsen, Jens; Oakley, Berl R.; Osmani, Stephen A.; Pakula, Tiina; Paszewski, Andrzej; Paulsen, Ian; Pilsyk, Sebastian; Pócsi, István; Punt, Peter J.; Ram, Arthur F.J.; Ren, Qinghu; Robellet, Xavier; Robson, Geoff; Seiboth, Bernhard; Solingen, Piet van; Specht, Thomas; Sun, Jibin; Taheri-Talesh, Naimeh; Takeshita, Norio; Ussery, Dave; vanKuyk, Patricia A.; Visser, Hans; van de Vondervoort, Peter J.I.; de Vries, Ronald P.; Walton, Jonathan; Xiang, Xin; Xiong, Yi; Zeng, An Ping; Brandt, Bernd W.; Cornell, Michael J.; van den Hondel, Cees A.M.J.J.; Visser, Jacob; Oliver, Stephen G.; Turner, Geoffrey

    2010-01-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology. PMID:19146970

  8. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    PubMed Central

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-01-01

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception. PMID:25666585

  9. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    SciTech Connect

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, III, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  10. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud.

    PubMed

    Duvick, Jon; Standage, Daniel S; Merchant, Nirav; Brendel, Volker P

    2016-04-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching.

  11. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud.

    PubMed

    Duvick, Jon; Standage, Daniel S; Merchant, Nirav; Brendel, Volker P

    2016-04-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  12. Scripps Genome ADVISER: Annotation and Distributed Variant Interpretation SERver

    PubMed Central

    Pham, Phillip H.; Shipman, William J.; Erikson, Galina A.; Schork, Nicholas J.; Torkamani, Ali

    2015-01-01

    Interpretation of human genomes is a major challenge. We present the Scripps Genome ADVISER (SG-ADVISER) suite, which aims to fill the gap between data generation and genome interpretation by performing holistic, in-depth, annotations and functional predictions on all variant types and effects. The SG-ADVISER suite includes a de-identification tool, a variant annotation web-server, and a user interface for inheritance and annotation-based filtration. SG-ADVISER allows users with no bioinformatics expertise to manipulate large volumes of variant data with ease – without the need to download large reference databases, install software, or use a command line interface. SG-ADVISER is freely available at genomics.scripps.edu/ADVISER. PMID:25706643

  13. Combined evidence annotation of transposable elements in genome sequences.

    PubMed

    Quesneville, Hadi; Bergman, Casey M; Andrieu, Olivier; Autard, Delphine; Nouaud, Danielle; Ashburner, Michael; Anxolabehere, Dominique

    2005-07-01

    Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated "TE models" in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the

  14. OMIGA: Optimized Maker-Based Insect Genome Annotation.

    PubMed

    Liu, Jinding; Xiao, Huamei; Huang, Shuiqing; Li, Fei

    2014-08-01

    Insects are one of the largest classes of animals on Earth and constitute more than half of all living species. The i5k initiative has begun sequencing of more than 5,000 insect genomes, which should greatly help in exploring insect resource and pest control. Insect genome annotation remains challenging because many insects have high levels of heterozygosity. To improve the quality of insect genome annotation, we developed a pipeline, named Optimized Maker-Based Insect Genome Annotation (OMIGA), to predict protein-coding genes from insect genomes. We first mapped RNA-Seq reads to genomic scaffolds to determine transcribed regions using Bowtie, and the putative transcripts were assembled using Cufflink. We then selected highly reliable transcripts with intact coding sequences to train de novo gene prediction software, including Augustus. The re-trained software was used to predict genes from insect genomes. Exonerate was used to refine gene structure and to determine near exact exon/intron boundary in the genome. Finally, we used the software Maker to integrate data from RNA-Seq, de novo gene prediction, and protein alignment to produce an official gene set. The OMIGA pipeline was used to annotate the draft genome of an important insect pest, Chilo suppressalis, yielding 12,548 genes. Different strategies were compared, which demonstrated that OMIGA had the best performance. In summary, we present a comprehensive pipeline for identifying genes in insect genomes that can be widely used to improve the annotation quality in insects. OMIGA is provided at http://ento.njau.edu.cn/omiga.html . PMID:24609470

  15. Annotated features of domestic cat – Felis catus genome

    PubMed Central

    2014-01-01

    Background Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotation a highly informative resource that facilitates multifaceted research endeavors. Findings Here we report a preliminary annotation of the whole genome sequence of Cinnamon, a domestic cat living in Columbia (MO, USA), bisulfite sequencing of Boris, a male cat from St. Petersburg (Russia), and light 30× sequencing of Sylvester, a European wildcat progenitor of cat domestication. The annotation includes 21,865 protein-coding genes identified by a comparative approach, 217 loci of endogenous retrovirus-like elements, repetitive elements which comprise about 55.7% of the whole genome, 99,494 new SNVs, 8,355 new indels, 743,326 evolutionary constrained elements, and 3,182 microRNA homologues. The methylation sites study shows that 10.5% of cat genome cytosines are methylated. An assisted assembly of a European wildcat, Felis silvestris silvestris, was performed; variants between F. silvestris and F. catus genomes were derived and compared to F. catus. Conclusions The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing. The assembly and its annotation offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery. PMID:25143822

  16. Mouse genome annotation by the RefSeq project.

    PubMed

    McGarvey, Kelly M; Goldfarb, Tamara; Cox, Eric; Farrell, Catherine M; Gupta, Tripti; Joardar, Vinita S; Kodali, Vamsi K; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Webb, David; Wright, Mathew W; Murphy, Terence D; Pruitt, Kim D

    2015-10-01

    Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.

  17. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations.

    PubMed

    Gremme, Gordon; Steinbiss, Sascha; Kurtz, Stefan

    2013-01-01

    Genome annotations are often published as plain text files describing genomic features and their subcomponents by an implicit annotation graph. In this paper, we present the GenomeTools, a convenient and efficient software library and associated software tools for developing bioinformatics software intended to create, process or convert annotation graphs. The GenomeTools strictly follow the annotation graph approach, offering a unified graph-based representation. This gives the developer intuitive and immediate access to genomic features and tools for their manipulation. To process large annotation sets with low memory overhead, we have designed and implemented an efficient pull-based approach for sequential processing of annotations. This allows to handle even the largest annotation sets, such as a complete catalogue of human variations. Our object-oriented C-based software library enables a developer to conveniently implement their own functionality on annotation graphs and to integrate it into larger workflows, simultaneously accessing compressed sequence data if required. The careful C implementation of the GenomeTools does not only ensure a light-weight memory footprint while allowing full sequential as well as random access to the annotation graph, but also facilitates the creation of bindings to a variety of script programming languages (like Python and Ruby) sharing the same interface. PMID:24091398

  18. Genome annotation of a Saccharomyces sp. lager brewer's yeast.

    PubMed

    De León-Medina, Patricia Marcela; Elizondo-González, Ramiro; Damas-Buenrostro, Luis Cástulo; Geertman, Jan-Maarten; Van den Broek, Marcel; Galán-Wong, Luis Jesús; Ortiz-López, Rocío; Pereyra-Alférez, Benito

    2016-09-01

    The genome of lager brewer's yeast is a hybrid, with Saccharomyces eubayanus and Saccharomyces cerevisiae as sub-genomes. Due to their specific use in the beer industry, relatively little information is available. The genome of brewing yeast was sequenced and annotated in this study. We obtained a genome size of 22.7 Mbp that consisted of 133 scaffolds, with 65 scaffolds larger than 10 kbp. With respect to the annotation, 9939 genes were obtained, and when they were submitted to a local alignment, we found that 53.93% of these genes corresponded to S. cerevisiae, while another 42.86% originated from S. eubayanus. Our results confirm that our strain is a hybrid of at least two different genomes.

  19. Genome annotation of a Saccharomyces sp. lager brewer's yeast.

    PubMed

    De León-Medina, Patricia Marcela; Elizondo-González, Ramiro; Damas-Buenrostro, Luis Cástulo; Geertman, Jan-Maarten; Van den Broek, Marcel; Galán-Wong, Luis Jesús; Ortiz-López, Rocío; Pereyra-Alférez, Benito

    2016-09-01

    The genome of lager brewer's yeast is a hybrid, with Saccharomyces eubayanus and Saccharomyces cerevisiae as sub-genomes. Due to their specific use in the beer industry, relatively little information is available. The genome of brewing yeast was sequenced and annotated in this study. We obtained a genome size of 22.7 Mbp that consisted of 133 scaffolds, with 65 scaffolds larger than 10 kbp. With respect to the annotation, 9939 genes were obtained, and when they were submitted to a local alignment, we found that 53.93% of these genes corresponded to S. cerevisiae, while another 42.86% originated from S. eubayanus. Our results confirm that our strain is a hybrid of at least two different genomes. PMID:27330999

  20. Expert System for Computer-assisted Annotation of MS/MS Spectra*

    PubMed Central

    Neuhauser, Nadin; Michalski, Annette; Cox, Jürgen; Mann, Matthias

    2012-01-01

    An important step in mass spectrometry (MS)-based proteomics is the identification of peptides by their fragment spectra. Regardless of the identification score achieved, almost all tandem-MS (MS/MS) spectra contain remaining peaks that are not assigned by the search engine. These peaks may be explainable by human experts but the scale of modern proteomics experiments makes this impractical. In computer science, Expert Systems are a mature technology to implement a list of rules generated by interviews with practitioners. We here develop such an Expert System, making use of literature knowledge as well as a large body of high mass accuracy and pure fragmentation spectra. Interestingly, we find that even with high mass accuracy data, rule sets can quickly become too complex, leading to over-annotation. Therefore we establish a rigorous false discovery rate, calculated by random insertion of peaks from a large collection of other MS/MS spectra, and use it to develop an optimized knowledge base. This rule set correctly annotates almost all peaks of medium or high abundance. For high resolution HCD data, median intensity coverage of fragment peaks in MS/MS spectra increases from 58% by search engine annotation alone to 86%. The resulting annotation performance surpasses a human expert, especially on complex spectra such as those of larger phosphorylated peptides. Our system is also applicable to high resolution collision-induced dissociation data. It is available both as a part of MaxQuant and via a webserver that only requires an MS/MS spectrum and the corresponding peptides sequence, and which outputs publication quality, annotated MS/MS spectra (www.biochem.mpg.de/mann/tools/). It provides expert knowledge to beginners in the field of MS-based proteomics and helps advanced users to focus on unusual and possibly novel types of fragment ions. PMID:22888147

  1. Expert system for computer-assisted annotation of MS/MS spectra.

    PubMed

    Neuhauser, Nadin; Michalski, Annette; Cox, Jürgen; Mann, Matthias

    2012-11-01

    An important step in mass spectrometry (MS)-based proteomics is the identification of peptides by their fragment spectra. Regardless of the identification score achieved, almost all tandem-MS (MS/MS) spectra contain remaining peaks that are not assigned by the search engine. These peaks may be explainable by human experts but the scale of modern proteomics experiments makes this impractical. In computer science, Expert Systems are a mature technology to implement a list of rules generated by interviews with practitioners. We here develop such an Expert System, making use of literature knowledge as well as a large body of high mass accuracy and pure fragmentation spectra. Interestingly, we find that even with high mass accuracy data, rule sets can quickly become too complex, leading to over-annotation. Therefore we establish a rigorous false discovery rate, calculated by random insertion of peaks from a large collection of other MS/MS spectra, and use it to develop an optimized knowledge base. This rule set correctly annotates almost all peaks of medium or high abundance. For high resolution HCD data, median intensity coverage of fragment peaks in MS/MS spectra increases from 58% by search engine annotation alone to 86%. The resulting annotation performance surpasses a human expert, especially on complex spectra such as those of larger phosphorylated peptides. Our system is also applicable to high resolution collision-induced dissociation data. It is available both as a part of MaxQuant and via a webserver that only requires an MS/MS spectrum and the corresponding peptides sequence, and which outputs publication quality, annotated MS/MS spectra (www.biochem.mpg.de/mann/tools/). It provides expert knowledge to beginners in the field of MS-based proteomics and helps advanced users to focus on unusual and possibly novel types of fragment ions.

  2. GENCODE: the reference human genome annotation for The ENCODE Project.

    PubMed

    Harrow, Jennifer; Frankish, Adam; Gonzalez, Jose M; Tapanari, Electra; Diekhans, Mark; Kokocinski, Felix; Aken, Bronwen L; Barrell, Daniel; Zadissa, Amonida; Searle, Stephen; Barnes, If; Bignell, Alexandra; Boychenko, Veronika; Hunt, Toby; Kay, Mike; Mukherjee, Gaurab; Rajan, Jeena; Despacio-Reyes, Gloria; Saunders, Gary; Steward, Charles; Harte, Rachel; Lin, Michael; Howald, Cédric; Tanzer, Andrea; Derrien, Thomas; Chrast, Jacqueline; Walters, Nathalie; Balasubramanian, Suganthi; Pei, Baikang; Tress, Michael; Rodriguez, Jose Manuel; Ezkurdia, Iakes; van Baren, Jeltje; Brent, Michael; Haussler, David; Kellis, Manolis; Valencia, Alfonso; Reymond, Alexandre; Gerstein, Mark; Guigó, Roderic; Hubbard, Tim J

    2012-09-01

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

  3. Intra-species sequence comparisons for annotating genomes

    SciTech Connect

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  4. Functional annotations in bacterial genomes based on small RNA signatures.

    PubMed

    Sridhar, Jayavel; Rafi, Ziauddin Ahamed

    2008-04-04

    One of the key challenges in computational genomics is annotating coding genes and identification of regulatory RNAs in complete genomes. An attempt is made in this study which uses the regulatory RNA locations and their conserved flanking genes identified within the genomic backbone of template genome to search for similar RNA locations in query genomes. The search is based on recently reported coexistence of small RNAs and their conserved flanking genes in related genomes. Based on our study, 54 additional sRNA locations and functions of 96 uncharacterized genes are predicted in two draft genomes viz., Serratia marcesens Db1 and Yersinia enterocolitica 8081. Although most of the identified additional small RNA regions and their corresponding flanking genes are homologous in nature, the proposed anchoring technique could successfully identify four non-homologous small RNA regions in Y. enterocolitica genome also. The KEGG Orthology (KO) based automated functional predictions confirms the predicted functions of 65 flanking genes having defined KO numbers, out of the total 96 predictions made by this method. This coexistence based method shows more sensitivity than controlled vocabularies in locating orthologous gene pairs even in the absence of defined Orthology numbers. All functional predictions made by this study in Y. enterocolitica 8081 were confirmed by the recently published complete genome sequence and annotations. This study also reports the possible regions of gene rearrangements in these two genomes and further characterization of such RNA regions could shed more light on their possible role in genome evolution.

  5. Functional annotations in bacterial genomes based on small RNA signatures

    PubMed Central

    Sridhar, Jayavel; Rafi, Ziauddin Ahamed

    2008-01-01

    One of the key challenges in computational genomics is annotating coding genes and identification of regulatory RNAs in complete genomes. An attempt is made in this study which uses the regulatory RNA locations and their conserved flanking genes identified within the genomic backbone of template genome to search for similar RNA locations in query genomes. The search is based on recently reported coexistence of small RNAs and their conserved flanking genes in related genomes. Based on our study, 54 additional sRNA locations and functions of 96 uncharacterized genes are predicted in two draft genomes viz., Serratia marcesens Db1 and Yersinia enterocolitica 8081. Although most of the identified additional small RNA regions and their corresponding flanking genes are homologous in nature, the proposed anchoring technique could successfully identify four non-homologous small RNA regions in Y. enterocolitica genome also. The KEGG Orthology (KO) based automated functional predictions confirms the predicted functions of 65 flanking genes having defined KO numbers, out of the total 96 predictions made by this method. This coexistence based method shows more sensitivity than controlled vocabularies in locating orthologous gene pairs even in the absence of defined Orthology numbers. All functional predictions made by this study in Y. enterocolitica 8081 were confirmed by the recently published complete genome sequence and annotations. This study also reports the possible regions of gene rearrangements in these two genomes and further characterization of such RNA regions could shed more light on their possible role in genome evolution. PMID:18478081

  6. Web Apollo: a web-based genomic annotation editing platform

    PubMed Central

    2013-01-01

    Web Apollo is the first instantaneous, collaborative genomic annotation editor available on the web. One of the natural consequences following from current advances in sequencing technology is that there are more and more researchers sequencing new genomes. These researchers require tools to describe the functional features of their newly sequenced genomes. With Web Apollo researchers can use any of the common browsers (for example, Chrome or Firefox) to jointly analyze and precisely describe the features of a genome in real time, whether they are in the same room or working from opposite sides of the world. PMID:24000942

  7. Improving Genome Assemblies and Annotations for Nonhuman Primates

    PubMed Central

    Norgren, Robert B.

    2013-01-01

    The study of nonhuman primates (NHP) is key to understanding human evolution, in addition to being an important model for biomedical research. NHPs are especially important for translational medicine. There are now exciting opportunities to greatly increase the utility of these models by incorporating Next Generation (NextGen) sequencing into study design. Unfortunately, the draft status of nonhuman genomes greatly constrains what can currently be accomplished with available technology. Although all genomes contain errors, draft assemblies and annotations contain so many mistakes that they make currently available nonhuman primate genomes misleading to investigators conducting evolutionary studies; and these genomes are of insufficient quality to serve as references for NextGen studies. Fortunately, NextGen sequencing can be used in the production of greatly improved genomes. Existing Sanger sequences can be supplemented with NextGen whole genome, and exomic genomic sequences to create new, more complete and correct assemblies. Additional physical mapping, and an incorporation of information about gene structure, can be used to improve assignment of scaffolds to chromosomes. In addition, mRNA-sequence data can be used to economically acquire transcriptome information, which can be used for annotation. Some highly polymorphic and complex regions, for example MHC class I and immunoglobulin loci, will require extra effort to properly assemble and annotate. However, for the vast majority of genes, a modest investment in money, and a somewhat greater investment in time, can greatly improve assemblies and annotations sufficient to produce true, reference grade nonhuman primate genomes. Such resources can reasonably be expected to transform nonhuman primate research. PMID:24174438

  8. Genome Annotation in a Community College Cell Biology Lab

    ERIC Educational Resources Information Center

    Beagley, C. Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning…

  9. Annotation of the Clostridium Acetobutylicum Genome

    SciTech Connect

    Daly, M. J.

    2004-06-09

    The genome sequence of the solvent producing bacterium Clostridium acetobutylicum ATCC824, has been determined by the shotgun approach. The genome consists of a 3.94 Mb chromosome and a 192 kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases, closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria.

  10. MITOS: improved de novo metazoan mitochondrial genome annotation.

    PubMed

    Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F

    2013-11-01

    About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.

  11. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-01

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. PMID:26578582

  12. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database

    PubMed Central

    Winsor, Geoffrey L.; Griffiths, Emma J.; Lo, Raymond; Dhillon, Bhavjinder K.; Shay, Julie A.; Brinkman, Fiona S. L.

    2016-01-01

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. PMID:26578582

  13. An Improved microRNA Annotation of the Canine Genome.

    PubMed

    Penso-Dolfin, Luca; Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  14. An Improved microRNA Annotation of the Canine Genome

    PubMed Central

    Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  15. GBshape: a genome browser database for DNA shape annotations.

    PubMed

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J; Parker, Stephen C J; Nuzhdin, Sergey V; Tullius, Thomas D; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.

  16. MIPS: analysis and annotation of proteins from whole genomes.

    PubMed

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354

  17. nGASP - the nematode genome annotation assessment project

    SciTech Connect

    Coghlan, A; Fiedler, T J; McKay, S J; Flicek, P; Harris, T W; Blasiar, D; Allen, J; Stein, L D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome

  18. Assembly, Annotation, and Analysis of Multiple Mycorrhizal Fungal Genomes

    SciTech Connect

    Initiative Consortium, Mycorrhizal Genomics; Kuo, Alan; Grigoriev, Igor; Kohler, Annegret; Martin, Francis

    2013-03-08

    Mycorrhizal fungi play critical roles in host plant health, soil community structure and chemistry, and carbon and nutrient cycling, all areas of intense interest to the US Dept. of Energy (DOE) Joint Genome Institute (JGI). To this end we are building on our earlier sequencing of the Laccaria bicolor genome by partnering with INRA-Nancy and the mycorrhizal research community in the MGI to sequence and analyze dozens of mycorrhizal genomes of all Basidiomycota and Ascomycota orders and multiple ecological types (ericoid, orchid, and ectomycorrhizal). JGI has developed and deployed high-throughput sequencing techniques, and Assembly, RNASeq, and Annotation Pipelines. In 2012 alone we sequenced, assembled, and annotated 12 draft or improved genomes of mycorrhizae, and predicted ~;;232831 genes and ~;;15011 multigene families, All of this data is publicly available on JGI MycoCosm (http://jgi.doe.gov/fungi/), which provides access to both the genome data and tools with which to analyze the data. Preliminary comparisons of the current total of 14 public mycorrhizal genomes suggest that 1) short secreted proteins potentially involved in symbiosis are more enriched in some orders than in others amongst the mycorrhizal Agaricomycetes, 2) there are wide ranges of numbers of genes involved in certain functional categories, such as signal transduction and post-translational modification, and 3) novel gene families are specific to some ecological types.

  19. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations

    PubMed Central

    Paila, Umadevi; Chapman, Brad A.; Kirchner, Rory; Quinlan, Aaron R.

    2013-01-01

    Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics. PMID:23874191

  20. Scaling up genome annotation using MAKER and work queue.

    PubMed

    Thrasher, Andrew; Musgrave, Zachary; Kachmarck, Brian; Thain, Douglas; Emrich, Scott

    2014-01-01

    Next generation sequencing technologies have enabled sequencing many genomes. Because of the overall increasing demand and the inherent parallelism available in many required analyses, these bioinformatics applications should ideally run on clusters, clouds and/or grids. We present a modified annotation framework that achieves a speed-up of 45x using 50 workers using a Caenorhabditis japonica test case. We also evaluate these modifications within the Amazon EC2 cloud framework. The underlying genome annotation (MAKER) is parallelised as an MPI application. Our framework enables it to now run without MPI while utilising a wide variety of distributed computing resources. This parallel framework also allows easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that often rely on shared file systems. Combined, our proposed framework can be used, even during early stages of development, to easily run sequence analysis tools on clusters, grids and clouds.

  1. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation. PMID:26017574

  2. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

  3. Sequencing and annotated analysis of an Estonian human genome.

    PubMed

    Lilleoja, Rutt; Sarapik, Aili; Reimann, Ene; Reemann, Paula; Jaakma, Ülle; Vasar, Eero; Kõks, Sulev

    2012-02-01

    In present study we describe the sequencing and annotated analysis of the individual genome of Estonian. Using SOLID technology we generated 2,449,441,916 of 50-bp reads. The Bioscope version 1.3 was used for mapping and pairing of reads to the NCBI human genome reference (build 36, hg18). Bioscope enables also the annotation of the results of variant (tertiary) analysis. The average mapping of reads was 75.5% with total coverage of 107.72 Gb. resulting in mean fold coverage of 34.6. We found 3,482,975 SNPs out of which 352,492 were novel. 21,222 SNPs were in coding region: 10,649 were synonymous SNPs, 10,360 were nonsynonymous missense SNPs, 155 were nonsynonymous nonsense SNPs and 58 were nonsynonymous frameshifts. We identified 219 CNVs with total base pair coverage of 37,326,300 bp and 87,451 large insertion/deletion polymorphisms covering 10,152,256 bp of the genome. In addition, we found 285,864 small size insertion/deletion polymorphisms out of which 133,969 were novel. Finally, we identified 53 inversions, 19 overlapped genes and 2 overlapped exons. Interestingly, we found the region in chromosome 6 to be enriched with the coding SNPs and CNVs. This study confirms previous findings, that our genomes are more complex and variable as thought before. Therefore, sequencing of the personal genomes followed by annotation would improve the analysis of heritability of phenotypes and our understandings on the functions of genome.

  4. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation.

    PubMed

    Jackman, Shaun D; Warren, René L; Gibb, Ewan A; Vandervalk, Benjamin P; Mohamadi, Hamid; Chu, Justin; Raymond, Anthony; Pleasance, Stephen; Coope, Robin; Wildung, Mark R; Ritland, Carol E; Bousquet, Jean; Jones, Steven J M; Bohlmann, Joerg; Birol, Inanç

    2015-12-08

    The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.

  5. Sequencing and annotation of mitochondrial genomes from individual parasitic helminths.

    PubMed

    Jex, Aaron R; Littlewood, D Timothy; Gasser, Robin B

    2015-01-01

    Mitochondrial (mt) genomics has significant implications in a range of fundamental areas of parasitology, including evolution, systematics, and population genetics as well as explorations of mt biochemistry, physiology, and function. Mt genomes also provide a rich source of markers to aid molecular epidemiological and ecological studies of key parasites. However, there is still a paucity of information on mt genomes for many metazoan organisms, particularly parasitic helminths, which has often related to challenges linked to sequencing from tiny amounts of material. The advent of next-generation sequencing (NGS) technologies has paved the way for low cost, high-throughput mt genomic research, but there have been obstacles, particularly in relation to post-sequencing assembly and analyses of large datasets. In this chapter, we describe protocols for the efficient amplification and sequencing of mt genomes from small portions of individual helminths, and highlight the utility of NGS platforms to expedite mt genomics. In addition, we recommend approaches for manual or semi-automated bioinformatic annotation and analyses to overcome the bioinformatic "bottleneck" to research in this area. Taken together, these approaches have demonstrated applicability to a range of parasites and provide prospects for using complete mt genomic sequence datasets for large-scale molecular systematic and epidemiological studies. In addition, these methods have broader utility and might be readily adapted to a range of other medium-sized molecular regions (i.e., 10-100 kb), including large genomic operons, and other organellar (e.g., plastid) and viral genomes.

  6. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation

    PubMed Central

    Jackman, Shaun D.; Warren, René L.; Gibb, Ewan A.; Vandervalk, Benjamin P.; Mohamadi, Hamid; Chu, Justin; Raymond, Anthony; Pleasance, Stephen; Coope, Robin; Wildung, Mark R.; Ritland, Carol E.; Bousquet, Jean; Jones, Steven J. M.; Bohlmann, Joerg; Birol, Inanç

    2016-01-01

    The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data. PMID:26645680

  7. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; et al

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  8. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    PubMed Central

    2011-01-01

    Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets. PMID:22192575

  9. Design and implementation of a database for Brucella melitensis genome annotation.

    PubMed

    De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

    2008-03-18

    The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.

  10. The power of EST sequence data: Relation to Acyrthosiphon pisum genome annotation and functional genomics initiatives

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genes important to aphid biology, survival and reproduction were successfully identified by use of a genomics approach. We created and described the Sequencing, compilation, and annotation of the approxiamtely 525Mb nuclear genome of the pea aphid, Acyrthosiphon pisum, which represents an important ...

  11. The Saccharomyces Genome Database: Exploring Genome Features and Their Annotations.

    PubMed

    Cherry, J Michael

    2015-12-01

    Genomic-scale assays result in data that provide information over the entire genome. Such base pair resolution data cannot be summarized easily except via a graphical viewer. A genome browser is a tool that displays genomic data and experimental results as horizontal tracks. Genome browsers allow searches for a chromosomal coordinate or a feature, such as a gene name, but they do not allow searches by function or upstream binding site. Entry into a genome browser requires that you identify the gene name or chromosomal coordinates for a region of interest. A track provides a representation for genomic results and is displayed as a row of data shown as line segments to indicate regions of the chromosome with a feature. Another type of track presents a graph or wiggle plot that indicates the processed signal intensity computed for a particular experiment or set of experiments. Wiggle plots are typical for genomic assays such as the various next-generation sequencing methods (e.g., chromatin immunoprecipitation [ChIP]-seq or RNA-seq), where it represents a peak of DNA binding, histone modification, or the mapping of an RNA sequence. Here we explore the browser that has been built into the Saccharomyces Genome Database (SGD).

  12. Gene3D: comprehensive structural and functional annotation of genomes.

    PubMed

    Yeats, Corin; Lees, Jonathan; Reid, Adam; Kellam, Paul; Martin, Nigel; Liu, Xinhui; Orengo, Christine

    2008-01-01

    Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein-protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/ PMID:18032434

  13. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation.

    PubMed

    McNeil, Leslie Klis; Reich, Claudia; Aziz, Ramy K; Bartels, Daniela; Cohoon, Matthew; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Hwang, Kaitlyn; Kubal, Michael; Margaryan, Gohar Rem; Meyer, Folker; Mihalo, William; Olsen, Gary J; Olson, Robert; Osterman, Andrei; Paarmann, Daniel; Paczian, Tobias; Parrello, Bruce; Pusch, Gordon D; Rodionov, Dmitry A; Shi, Xinghua; Vassieva, Olga; Vonstein, Veronika; Zagnitko, Olga; Xia, Fangfang; Zinner, Jenifer; Overbeek, Ross; Stevens, Rick

    2007-01-01

    The National Microbial Pathogen Data Resource (NMPDR) (http://www.nmpdr.org) is a National Institute of Allergy and Infections Disease (NIAID)-funded Bioinformatics Resource Center that supports research in selected Category B pathogens. NMPDR contains the complete genomes of approximately 50 strains of pathogenic bacteria that are the focus of our curators, as well as >400 other genomes that provide a broad context for comparative analysis across the three phylogenetic Domains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide the most consistent genome annotations. Subsystems are sets of functional roles related by a biologically meaningful organizing principle, which are built over large collections of genomes; they provide researchers with consistent functional assignments in a biologically structured context. Investigators can browse subsystems and reactions to develop accurate reconstructions of the metabolic networks of any sequenced organism. NMPDR provides a comprehensive bioinformatics platform, with tools and viewers for genome analysis. Results of precomputed gene clustering analyses can be retrieved in tabular or graphic format with one-click tools. NMPDR tools include Signature Genes, which finds the set of genes in common or that differentiates two groups of organisms. Essentiality data collated from genome-wide studies have been curated. Drug target identification and high-throughput, in silico, compound screening are in development. PMID:17145713

  14. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation

    PubMed Central

    McNeil, Leslie Klis; Reich, Claudia; Aziz, Ramy K.; Bartels, Daniela; Cohoon, Matthew; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Hwang, Kaitlyn; Kubal, Michael; Margaryan, Gohar Rem; Meyer, Folker; Mihalo, William; Olsen, Gary J.; Olson, Robert; Osterman, Andrei; Paarmann, Daniel; Paczian, Tobias; Parrello, Bruce; Pusch, Gordon D.; Rodionov, Dmitry A.; Shi, Xinghua; Vassieva, Olga; Vonstein, Veronika; Zagnitko, Olga; Xia, Fangfang; Zinner, Jenifer; Overbeek, Ross; Stevens, Rick

    2007-01-01

    The National Microbial Pathogen Data Resource (NMPDR) () is a National Institute of Allergy and Infections Disease (NIAID)-funded Bioinformatics Resource Center that supports research in selected Category B pathogens. NMPDR contains the complete genomes of ∼50 strains of pathogenic bacteria that are the focus of our curators, as well as >400 other genomes that provide a broad context for comparative analysis across the three phylogenetic Domains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide the most consistent genome annotations. Subsystems are sets of functional roles related by a biologically meaningful organizing principle, which are built over large collections of genomes; they provide researchers with consistent functional assignments in a biologically structured context. Investigators can browse subsystems and reactions to develop accurate reconstructions of the metabolic networks of any sequenced organism. NMPDR provides a comprehensive bioinformatics platform, with tools and viewers for genome analysis. Results of precomputed gene clustering analyses can be retrieved in tabular or graphic format with one-click tools. NMPDR tools include Signature Genes, which finds the set of genes in common or that differentiates two groups of organisms. Essentiality data collated from genome-wide studies have been curated. Drug target identification and high-throughput, in silico, compound screening are in development. PMID:17145713

  15. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud[OPEN

    PubMed Central

    Merchant, Nirav

    2016-01-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  16. Genome annotation in a community college cell biology lab.

    PubMed

    Beagley, C Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning skills. Additionally, the project strengthens student understanding of the scientific method and contributes to student learning gains in curricular objectives centered around basic molecular biology, specifically, the Central Dogma. Importantly, inclusion of this project in the laboratory course provides students with a positive learning environment and allows for the use of cooperative learning strategies to increase overall student success.

  17. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd.

    PubMed

    Irshad, H; Montaser-Kouhsari, L; Waltz, G; Bucur, O; Nowak, J A; Dong, F; Knoblauch, N W; Beck, A H

    2015-01-01

    The development of tools in computational pathology to assist physicians and biomedical scientists in the diagnosis of disease requires access to high-quality annotated images for algorithm learning and evaluation. Generating high-quality expert-derived annotations is time-consuming and expensive. We explore the use of crowdsourcing for rapidly obtaining annotations for two core tasks in com- putational pathology: nucleus detection and nucleus segmentation. We designed and implemented crowdsourcing experiments using the CrowdFlower platform, which provides access to a large set of labor channel partners that accesses and manages millions of contributors worldwide. We obtained annotations from four types of annotators and compared concordance across these groups. We obtained: crowdsourced annotations for nucleus detection and segmentation on a total of 810 images; annotations using automated methods on 810 images; annotations from research fellows for detection and segmentation on 477 and 455 images, respectively; and expert pathologist-derived annotations for detection and segmentation on 80 and 63 images, respectively. For the crowdsourced annotations, we evaluated performance across a range of contributor skill levels (1, 2, or 3). The crowdsourced annotations (4,860 images in total) were completed in only a fraction of the time and cost required for obtaining annotations using traditional methods. For the nucleus detection task, the research fellow-derived annotations showed the strongest concordance with the expert pathologist- derived annotations (F-M =93.68%), followed by the crowd-sourced contributor levels 1,2, and 3 and the automated method, which showed relatively similar performance (F-M = 87.84%, 88.49%, 87.26%, and 86.99%, respectively). For the nucleus segmentation task, the crowdsourced contributor level 3-derived annotations, research fellow-derived annotations, and automated method showed the strongest concordance with the expert pathologist

  18. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation.

    PubMed

    Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco

    2016-04-01

    Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline.

  19. The Genomic CDS Sandbox: An Assessment Among Domain Experts

    PubMed Central

    Aziz, Ayesha; Kawamoto, Kensaku; Eilbeck, Karen; Williams, Marc S.; Freimuth, Robert R.; Hoffman, Mark A.; Rasmussen, Luke V.; Overby, Casey L.; Shirts, Brian H.; Hoffman, James M.; Welch, Brandon M.

    2016-01-01

    Genomics is a promising tool that is becoming more widely available to improve the care and treatment of individuals. While there is much assertion, genomics will most certainly require the use of clinical decision support (CDS) to be fully realized in the routine clinical setting. The National Human Genome Research Institute (NHGRI) of the National Institutes of Health recently convened an in-person, multi-day meeting on this topic. It was widely recognized that there is a need to promote the innovation and development of resources for genomic CDS such as a CDS sandbox. The purpose of this study was to evaluate a proposed approach for such a genomic CDS sandbox among domain experts and potential users. Survey results indicate a significant interest and desire for a genomic CDS sandbox environment among domain experts. These results will be used to guide the development of a genomic CDS sandbox. PMID:26778834

  20. The genomic CDS sandbox: An assessment among domain experts.

    PubMed

    Aziz, Ayesha; Kawamoto, Kensaku; Eilbeck, Karen; Williams, Marc S; Freimuth, Robert R; Hoffman, Mark A; Rasmussen, Luke V; Overby, Casey L; Shirts, Brian H; Hoffman, James M; Welch, Brandon M

    2016-04-01

    Genomics is a promising tool that is becoming more widely available to improve the care and treatment of individuals. While there is much assertion, genomics will most certainly require the use of clinical decision support (CDS) to be fully realized in the routine clinical setting. The National Human Genome Research Institute (NHGRI) of the National Institutes of Health recently convened an in-person, multi-day meeting on this topic. It was widely recognized that there is a need to promote the innovation and development of resources for genomic CDS such as a CDS sandbox. The purpose of this study was to evaluate a proposed approach for such a genomic CDS sandbox among domain experts and potential users. Survey results indicate a significant interest and desire for a genomic CDS sandbox environment among domain experts. These results will be used to guide the development of a genomic CDS sandbox.

  1. AGeS: A Software System for Microbial Genome Sequence Annotation

    PubMed Central

    Kumar, Kamal; Desai, Valmik; Cheng, Li; Khitrov, Maxim; Grover, Deepak; Satya, Ravi Vijaya; Yu, Chenggang; Zavaljevski, Nela; Reifman, Jaques

    2011-01-01

    Background The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. Methodology The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions. PMID:21408217

  2. MaGe: a microbial genome annotation system supported by synteny results.

    PubMed

    Vallenet, David; Labarre, Laurent; Rouy, Zoé; Barbe, Valérie; Bocs, Stéphanie; Cruveiller, Stéphane; Lajus, Aurélie; Pascal, Géraldine; Scarpelli, Claude; Médigue, Claudine

    2006-01-01

    Magnifying Genomes (MaGe) is a microbial genome annotation system based on a relational database containing information on bacterial genomes, as well as a web interface to achieve genome annotation projects. Our system allows one to initiate the annotation of a genome at the early stage of the finishing phase. MaGe's main features are (i) integration of annotation data from bacterial genomes enhanced by a gene coding re-annotation process using accurate gene models, (ii) integration of results obtained with a wide range of bioinformatics methods, among which exploration of gene context by searching for conserved synteny and reconstruction of metabolic pathways, (iii) an advanced web interface allowing multiple users to refine the automatic assignment of gene product functions. MaGe is also linked to numerous well-known biological databases and systems. Our system has been thoroughly tested during the annotation of complete bacterial genomes (Acinetobacter baylyi ADP1, Pseudoalteromonas haloplanktis, Frankia alni) and is currently used in the context of several new microbial genome annotation projects. In addition, MaGe allows for annotation curation and exploration of already published genomes from various genera (e.g. Yersinia, Bacillus and Neisseria). MaGe can be accessed at http://www.genoscope.cns.fr/agc/mage. PMID:16407324

  3. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  4. MitoFish and MitoAnnotator: a mitochondrial genome database of fish with an accurate and automatic annotation pipeline.

    PubMed

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-11-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

  5. Functional annotation of introns in mitochondrial genome--a brief review.

    PubMed

    Anandakumar, Shanmugam; Ravindran, Suda Parimala; Shanmughavel, Piramanayagam

    2016-01-01

    The present study is to decipher the non-coding regions present in mitochondrial genomes that cause diseases in humans and predict their functional roles through comparative genomics approach followed by functional annotation of these segments.

  6. Biological database of images and genomes: tools for community annotations linking image and genomic information.

    PubMed

    Oberlin, Andrew T; Jurkovic, Dominika A; Balish, Mitchell F; Friedberg, Iddo

    2013-01-01

    Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype-genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas. PMID:23550062

  7. Biological Database of Images and Genomes: tools for community annotations linking image and genomic information

    PubMed Central

    Oberlin, Andrew T; Jurkovic, Dominika A; Balish, Mitchell F; Friedberg, Iddo

    2013-01-01

    Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas. Database URL: BioDIG website: http://biodig.org BioDIG source code repository: http://github.com/FriedbergLab/BioDIG The MyDIG database: http://mydig.biodig.org/ PMID:23550062

  8. Coordinated international action to accelerate genome-to-phenome with FAANG, The Functional Annotation of Animal Genomes project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  9. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; Tripp, H. James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A.; Pati, Amrita; et al

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  10. The discrepancies in the results of bioinformatics tools for genomic structural annotation

    NASA Astrophysics Data System (ADS)

    Pawełkowicz, Magdalena; Nowak, Robert; Osipowski, Paweł; Rymuszka, Jacek; Świerkula, Katarzyna; Wojcieszek, Michał; Przybecki, Zbigniew

    2014-11-01

    A major focus of sequencing project is to identify genes in genomes. However it is necessary to define the variety of genes and the criteria for identifying them. In this work we present discrepancies and dependencies from the application of different bioinformatic programs for structural annotation performed on the cucumber data set from Polish Consortium of Cucumber Genome Sequencing. We use Fgenesh, GenScan and GeneMark to automated structural annotation, the results have been compared to reference annotation.

  11. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

    PubMed Central

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

  12. GO-FAANG meeting: A gathering on functional annotation of animal genomes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The FAANG (Functional Annotation of Animal Genomes) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of non-model organisms (www.faang.or...

  13. Draft Genome Sequence and Gene Annotation of the Entomopathogenic Fungus Verticillium hemipterigenum

    PubMed Central

    Horn, Fabian; Habel, Andreas; Scharf, Daniel H.; Dworschak, Jan; Brakhage, Axel A.; Guthke, Reinhard

    2015-01-01

    Verticillium hemipterigenum (anamorph Torrubiella hemipterigena) is an entomopathogenic fungus and produces a broad range of secondary metabolites. Here, we present the draft genome sequence of the fungus, including gene structure and functional annotation. Genes were predicted incorporating RNA-Seq data and functionally annotated to provide the basis for further genome studies. PMID:25614560

  14. Australian sea-floor survey data, with images and expert annotations

    PubMed Central

    Bewley, Michael; Friedman, Ariell; Ferrari, Renata; Hill, Nicole; Hovey, Renae; Barrett, Neville; Pizarro, Oscar; Figueira, Will; Meyer, Lisa; Babcock, Russ; Bellchambers, Lynda; Byrne, Maria; Williams, Stefan B.

    2015-01-01

    This Australian benthic data set (BENTHOZ-2015) consists of an expert-annotated set of georeferenced benthic images and associated sensor data, captured by an autonomous underwater vehicle (AUV) around Australia. This type of data is of interest to marine scientists studying benthic habitats and organisms. AUVs collect georeferenced images over an area with consistent illumination and altitude, and make it possible to generate broad scale, photo-realistic 3D maps. Marine scientists then typically spend several minutes on each of thousands of images, labeling substratum type and biota at a subset of points. Labels from four Australian research groups were combined using the CATAMI classification scheme, a hierarchical classification scheme based on taxonomy and morphology for scoring marine imagery. This data set consists of 407,968 expert labeled points from around the Australian coast, with associated images, geolocation and other sensor data. The robotic surveys that collected this data form part of Australia's Integrated Marine Observing System (IMOS) ongoing benthic monitoring program. There is reuse potential in marine science, robotics, and computer vision research. PMID:26528396

  15. Australian sea-floor survey data, with images and expert annotations.

    PubMed

    Bewley, Michael; Friedman, Ariell; Ferrari, Renata; Hill, Nicole; Hovey, Renae; Barrett, Neville; Pizarro, Oscar; Figueira, Will; Meyer, Lisa; Babcock, Russ; Bellchambers, Lynda; Byrne, Maria; Williams, Stefan B

    2015-01-01

    This Australian benthic data set (BENTHOZ-2015) consists of an expert-annotated set of georeferenced benthic images and associated sensor data, captured by an autonomous underwater vehicle (AUV) around Australia. This type of data is of interest to marine scientists studying benthic habitats and organisms. AUVs collect georeferenced images over an area with consistent illumination and altitude, and make it possible to generate broad scale, photo-realistic 3D maps. Marine scientists then typically spend several minutes on each of thousands of images, labeling substratum type and biota at a subset of points. Labels from four Australian research groups were combined using the CATAMI classification scheme, a hierarchical classification scheme based on taxonomy and morphology for scoring marine imagery. This data set consists of 407,968 expert labeled points from around the Australian coast, with associated images, geolocation and other sensor data. The robotic surveys that collected this data form part of Australia's Integrated Marine Observing System (IMOS) ongoing benthic monitoring program. There is reuse potential in marine science, robotics, and computer vision research. PMID:26528396

  16. Australian sea-floor survey data, with images and expert annotations.

    PubMed

    Bewley, Michael; Friedman, Ariell; Ferrari, Renata; Hill, Nicole; Hovey, Renae; Barrett, Neville; Pizarro, Oscar; Figueira, Will; Meyer, Lisa; Babcock, Russ; Bellchambers, Lynda; Byrne, Maria; Williams, Stefan B

    2015-01-01

    This Australian benthic data set (BENTHOZ-2015) consists of an expert-annotated set of georeferenced benthic images and associated sensor data, captured by an autonomous underwater vehicle (AUV) around Australia. This type of data is of interest to marine scientists studying benthic habitats and organisms. AUVs collect georeferenced images over an area with consistent illumination and altitude, and make it possible to generate broad scale, photo-realistic 3D maps. Marine scientists then typically spend several minutes on each of thousands of images, labeling substratum type and biota at a subset of points. Labels from four Australian research groups were combined using the CATAMI classification scheme, a hierarchical classification scheme based on taxonomy and morphology for scoring marine imagery. This data set consists of 407,968 expert labeled points from around the Australian coast, with associated images, geolocation and other sensor data. The robotic surveys that collected this data form part of Australia's Integrated Marine Observing System (IMOS) ongoing benthic monitoring program. There is reuse potential in marine science, robotics, and computer vision research.

  17. Expressed Peptide Tags: An additional layer of data for genome annotation

    SciTech Connect

    Savidor, Alon; Donahoo, Ryan S; Hurtado-Gonzales, Oscar; Verberkmoes, Nathan C; Shah, Manesh B; Lamour, Kurt H; McDonald, W Hayes

    2006-01-01

    While genome sequencing is becoming ever more routine, genome annotation remains a challenging process. Identification of the coding sequences within the genomic milieu presents a tremendous challenge, especially for eukaryotes with their complex gene architectures. Here we present a method to assist the annotation process through the use of proteomic data and bioinformatics. Mass spectra of digested protein preparations of the organism of interest were acquired and searched against a protein database created by a six frame translation of the genome. The identified peptides were mapped back to the genome, compared to the current annotation, and then categorized as supporting or extending the current genome annotation. We named the classified peptides Expressed Peptide Tags (EPTs). The well annotated bacterium Rhodopseudomonas palustris was used as a control for the method and showed high degree of correlation between EPT mapping and the current annotation, with 86% of the EPTs confirming existing gene calls and less than 1% of the EPTs expanding on the current annotation. The eukaryotic plant pathogens Phytophthora ramorum and Phytophthora sojae, whose genomes have been recently sequenced and are much less well annotated, were also subjected to this method. A series of algorithmic steps were taken to increase the confidence of EPT identification for these organisms, including generation of smaller sub-databases to be searched against, and definition of EPT criteria that accommodates the more complex eukaryotic gene architecture. As expected, the analysis of the Phytophthora species showed less correlation between EPT mapping and their current annotation. While ~77% of Phytophthora EPTs supported the current annotation, a portion of them (7.2% and 12.6% for P. ramorum and P. sojae, respectively) suggested modification to current gene calls or identified novel genes that were missed by the current genome annotation of these organisms.

  18. MEGANTE: a web-based system for integrated plant genome annotation.

    PubMed

    Numa, Hisataka; Itoh, Takeshi

    2014-01-01

    The recent advancement of high-throughput genome sequencing technologies has resulted in a considerable increase in demands for large-scale genome annotation. While annotation is a crucial step for downstream data analyses and experimental studies, this process requires substantial expertise and knowledge of bioinformatics. Here we present MEGANTE, a web-based annotation system that makes plant genome annotation easy for researchers unfamiliar with bioinformatics. Without any complicated configuration, users can perform genomic sequence annotations simply by uploading a sequence and selecting the species to query. MEGANTE automatically runs several analysis programs and integrates the results to select the appropriate consensus exon-intron structures and to predict open reading frames (ORFs) at each locus. Functional annotation, including a similarity search against known proteins and a functional domain search, are also performed for the predicted ORFs. The resultant annotation information is visualized with a widely used genome browser, GBrowse. For ease of analysis, the results can be downloaded in Microsoft Excel format. All of the query sequences and annotation results are stored on the server side so that users can access their own data from virtually anywhere on the web. The current release of MEGANTE targets 24 plant species from the Brassicaceae, Fabaceae, Musaceae, Poaceae, Salicaceae, Solanaceae, Rosaceae and Vitaceae families, and it allows users to submit a sequence up to 10 Mb in length and to save up to 100 sequences with the annotation information on the server. The MEGANTE web service is available at https://megante.dna.affrc.go.jp/.

  19. Genome-enabled prediction of quantitative traits in chickens using genomic annotation

    PubMed Central

    2014-01-01

    Background Genome-wide association studies have been deemed successful for identifying statistically associated genetic variants of large effects on complex traits. Past studies have found enrichment of trait-associated SNPs in functionally annotated regions, while depletion was reported for intergenic regions (IGR). However, no systematic examination of connections between genomic regions and predictive ability of complex phenotypes has been carried out. Results In this study, we partitioned SNPs based on their annotation to characterize genomic regions that deliver low and high predictive power for three broiler traits in chickens using a whole-genome approach. Additive genomic relationship kernels were constructed for each of the genic regions considered, and a kernel-based Bayesian ridge regression was employed as prediction machine. We found that the predictive performance for ultrasound area of breast meat from using genic regions marked by SNPs was consistently better than that from SNPs in IGR, while IGR tagged by SNPs were better than the genic regions for body weight and hen house egg production. We also noted that predictive ability delivered by the whole battery of markers was close to the best prediction achieved by one of the genomic regions. Conclusions Whole-genome regression methods use all available quality filtered SNPs into a model, contrary to accommodating only validated SNPs from exonic or coding regions. Our results suggest that, while differences among genomic regions in terms of predictive ability were observed, the whole-genome approach remains as a promising tool if interest is on prediction of complex traits. PMID:24502227

  20. Assessment of features for automatic CTG analysis based on expert annotation.

    PubMed

    Chudácek, Vacláv; Spilka, Jirí; Lhotská, Lenka; Janku, Petr; Koucký, Michal; Huptych, Michal; Bursa, Miroslav

    2011-01-01

    Cardiotocography (CTG) is the monitoring of fetal heart rate (FHR) and uterine contractions (TOCO) since 1960's used routinely by obstetricians to detect fetal hypoxia. The evaluation of the FHR in clinical settings is based on an evaluation of macroscopic morphological features and so far has managed to avoid adopting any achievements from the HRV research field. In this work, most of the ever-used features utilized for FHR characterization, including FIGO, HRV, nonlinear, wavelet, and time and frequency domain features, are investigated and the features are assessed based on their statistical significance in the task of distinguishing the FHR into three FIGO classes. Annotation derived from the panel of experts instead of the commonly utilized pH values was used for evaluation of the features on a large data set (552 records). We conclude the paper by presenting the best uncorrelated features and their individual rank of importance according to the meta-analysis of three different ranking methods. Number of acceleration and deceleration, interval index, as well as Lempel-Ziv complexity and Higuchi's fractal dimension are among the top five features. PMID:22255719

  1. Toward a standard in structural genome annotation for prokaryotes

    DOE PAGES

    Tripp, H. James; Sutton, Granger; White, Owen; Wortman, Jennifer; Pati, Amrita; Mikhailova, Natalia; Ovchinnikova, Galina; Payne, Samuel H.; Kyrpides, Nikos C.; Ivanova, Natalia

    2015-07-25

    In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, we collected 1,004,576 peptides from various publicly available resources, and these were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes. We found thatmore » the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study. A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of experimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.« less

  2. Toward a standard in structural genome annotation for prokaryotes

    SciTech Connect

    Tripp, H. James; Sutton, Granger; White, Owen; Wortman, Jennifer; Pati, Amrita; Mikhailova, Natalia; Ovchinnikova, Galina; Payne, Samuel H.; Kyrpides, Nikos C.; Ivanova, Natalia

    2015-07-25

    In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, we collected 1,004,576 peptides from various publicly available resources, and these were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes. We found that the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study. A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of experimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.

  3. Improvement of barley genome annotations by deciphering the Haruna Nijo genome

    PubMed Central

    Sato, Kazuhiro; Tanaka, Tsuyoshi; Shigenobu, Shuji; Motoi, Yuka; Wu, Jianzhong; Itoh, Takeshi

    2016-01-01

    Full-length (FL) cDNA sequences provide the most reliable evidence for the presence of genes in genomes. In this report, detailed gene structures of barley, whole genome shotgun (WGS) and additional transcript data of the cultivar Haruna Nijo were quality controlled and compared with the published Morex genome information. Haruna Nijo scaffolds have longer total sequence length with much higher N50 and fewer sequences than those in Morex WGS contigs. The longer Haruna Nijo scaffolds provided efficient FLcDNA mapping, resulting in high coverage and detection of the transcription start sites. In combination with FLcDNAs and RNA-Seq data from four different tissue samples of Haruna Nijo, we identified 51,249 gene models on 30,606 loci. Overall sequence similarity between Haruna Nijo and Morex genome was 95.99%, while that of exon regions was higher (99.71%). These sequence and annotation data of Haruna Nijo are combined with Morex genome data and released from a genome browser. The genome sequence of Haruna Nijo may provide detailed gene structures in addition to the current Morex barley genome information. PMID:26622062

  4. Proteogenomics: the needs and roles to be filled by proteomics in genome annotation

    SciTech Connect

    Ansong, Charles; Purvine, Samuel O.; Adkins, Joshua N.; Lipton, Mary S.; Smith, Richard D.

    2008-01-01

    While genome sequencing efforts reveal the basic building blocks of life, a genome sequence alone is insufficient for elucidating biological function. Genome annotation – the process of identifying genes and assigning function to each gene in a genome sequence – provides the means to elucidate biological function from sequence. Current state-of-the-art high throughput genome annotation uses a combination of comparative (sequence similarity data) and non-comparative (ab initio gene prediction algorithms) methods to identify open reading frames in genome sequences. Because approaches used to validate the presence of these open reading frames are typically based on the information derived from the annotated genomes, they cannot independently and unequivocally determine whether a predicted open reading frame is translated into a protein. With the ability to directly measure peptides arising from expressed proteins, high throughput liquid chromatography-tandem mass spectrometry-based proteomics, approaches can be used to verify coding regions of a genomic sequence. Here, we highlight several ways in which high throughput tandem mass spectrometry-based proteomics can improve the quality of genome annotations and suggest that it could be efficiently applied during the initial gene calling process so that the improvements are propagated through the subsequent functional annotation process.

  5. Integration of RNA-seq and proteomics data with genomics for improved genome annotation in Apicomplexan parasites.

    PubMed

    Silmon de Monerri, Natalie C; Weiss, Louis M

    2015-08-01

    While high quality genomic sequence data is available for many pathogenic organisms, the corresponding gene annotations are often plagued with inaccuracies that can hinder research that utilizes such genomic data. Experimental validation of gene models is clearly crucial in improving such gene annotations; the field of proteogenomics is an emerging area of research wherein proteomic data is applied to testing and improving genetic models. Krishna et al. [Proteomics 2015, 15, 2618-2628] investigated whether incorporation of RNA-seq data into proteogenomics analyses can contribute significantly to validation studies of genome annotation, in two important parasitic organisms Toxoplasma gondii and Neospora caninum. They applied a systematic approach to combine new and previously published proteomics data from T. gondii and N. caninum with transcriptomics data, leading to substantially improved gene models for these organisms. This study illustrates the importance of incorporating experimental data from both proteomics and RNA-seq studies into routine genome annotation protocols.

  6. BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data

    PubMed Central

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  7. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation.

    PubMed

    Sharma, Virag; Elghafari, Anas; Hiller, Michael

    2016-06-20

    Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes. PMID:27016733

  8. [Experience in determining workload standards for the forensic medical expert in a genomic fingerprinting laboratory].

    PubMed

    Novoselov, V P; Sharonova, D A

    1994-01-01

    Working time expenditures needed for expert evaluation of material evidences by analysis of polymorphic genome sites using polymerase chain reaction and for establishment of a doubtful parentage by genomic "dactyloscopy" method were assessed. The measurements helped establish monthly and yearly standard load, that is, the number of expert evaluations per forensic medical expert of genomic "dactyloscopy" laboratory.

  9. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology.

    PubMed

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang; Wang, Yadong; Jin, Shuilin; Cheng, Liang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e - 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e - 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  10. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    PubMed Central

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  11. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    PubMed Central

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable.

  12. Using Microbial Genome Annotation as a Foundation for Collaborative Student Research

    ERIC Educational Resources Information Center

    Reed, Kelynne E.; Richardson, John M.

    2013-01-01

    We used the Integrated Microbial Genomes Annotation Collaboration Toolkit as a framework to incorporate microbial genomics research into a microbiology and biochemistry course in a way that promoted student learning of bioinformatics and research skills and emphasized teamwork and collaboration as evidenced through multiple assessment mechanisms.…

  13. Accurate Transposable Element Annotation Is Vital When Analyzing New Genome Assemblies

    PubMed Central

    Platt, Roy N.; Blanco-Berdugo, Laura; Ray, David A.

    2016-01-01

    Transposable elements (TEs) are mobile genetic elements with the ability to replicate themselves throughout the host genome. In some taxa TEs reach copy numbers in hundreds of thousands and can occupy more than half of the genome. The increasing number of reference genomes from nonmodel species has begun to outpace efforts to identify and annotate TE content and methods that are used vary significantly between projects. Here, we demonstrate variation that arises in TE annotations when less than optimal methods are used. We found that across a variety of taxa, the ability to accurately identify TEs based solely on homology decreased as the phylogenetic distance between the queried genome and a reference increased. Next we annotated repeats using homology alone, as is often the case in new genome analyses, and a combination of homology and de novo methods as well as an additional manual curation step. Reannotation using these methods identified a substantial number of new TE subfamilies in previously characterized genomes, recognized a higher proportion of the genome as repetitive, and decreased the average genetic distance within TE families, implying recent TE accumulation. Finally, these finding—increased recognition of younger TEs—were confirmed via an analysis of the postman butterfly (Heliconius melpomene). These observations imply that complete TE annotation relies on a combination of homology and de novo–based repeat identification, manual curation, and classification and that relying on simple, homology-based methods is insufficient to accurately describe the TE landscape of a newly sequenced genome. PMID:26802115

  14. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

    PubMed

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-05-27

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.

  15. Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome

    NASA Astrophysics Data System (ADS)

    Ernst, Jason; Kellis, Manolis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

  16. Annotation methods to develop and evaluate an expert system based on natural language processing in electronic medical records.

    PubMed

    Gicquel, Quentin; Tvardik, Nastassia; Bouvry, Côme; Kergourlay, Ivan; Bittar, André; Segond, Frédérique; Darmoni, Stefan; Metzger, Marie-Hélène

    2015-01-01

    The objective of the SYNODOS collaborative project was to develop a generic IT solution, combining a medical terminology server, a semantic analyser and a knowledge base. The goal of the project was to generate meaningful epidemiological data for various medical domains from the textual content of French medical records. In the context of this project, we built a care pathway oriented conceptual model and corresponding annotation method to develop and evaluate an expert system's knowledge base. The annotation method is based on a semi-automatic process, using a software application (MedIndex). This application exchanges with a cross-lingual multi-termino-ontology portal. The annotator selects the most appropriate medical code proposed for the medical concept in question by the multi-termino-ontology portal and temporally labels the medical concept according to the course of the medical event. This choice of conceptual model and annotation method aims to create a generic database of facts for the secondary use of electronic health records data.

  17. GenColors: annotation and comparative genomics of prokaryotes made easy.

    PubMed

    Romualdi, Alessandro; Felder, Marius; Rose, Dominic; Gausmann, Ulrike; Schilhabel, Markus; Glöckner, Gernot; Platzer, Matthias; Sühnel, Jürgen

    2007-01-01

    GenColors (gencolors.fli-leibniz.de) is a new web-based software/database system aimed at an improved and accelerated annotation of prokaryotic genomes considering information on related genomes and making extensive use of genome comparison. It offers a seamless integration of data from ongoing sequencing projects and annotated genomic sequences obtained from GenBank. A variety of export/import filters manages an effective data flow from sequence assembly and manipulation programs (e.g., GAP4) to GenColors and back as well as to standard GenBank file(s). The genome comparison tools include best bidirectional hits, gene conservation, syntenies, and gene core sets. Precomputed UniProt matches allow annotation and analysis in an effective manner. In addition to these analysis options, base-specific quality data (coverage and confidence) can also be handled if available. The GenColors system can be used both for annotation purposes in ongoing genome projects and as an analysis tool for finished genomes. GenColors comes in two types, as dedicated genome browsers and as the Jena Prokaryotic Genome Viewer (JPGV). Dedicated genome browsers contain genomic information on a set of related genomes and offer a large number of options for genome comparison. The system has been efficiently used in the genomic sequencing of Borrelia garinii and is currently applied to various ongoing genome projects on Borrelia, Legionella, Escherichia, and Pseudomonas genomes. One of these dedicated browsers, the Spirochetes Genome Browser (sgb.fli-leibniz.de) with Borrelia, Leptospira, and Treponema genomes, is freely accessible. The others will be released after finalization of the corresponding genome projects. JPGV (jpgv.fli-leibniz.de) offers information on almost all finished bacterial genomes, as compared to the dedicated browsers with reduced genome comparison functionality, however. As of January 2006, this viewer includes 632 genomic elements (e.g., chromosomes and plasmids) of 293

  18. Improvement of whole-genome annotation of cereals through comparative analyses

    PubMed Central

    Zhu, Wei; Buell, C. Robin

    2007-01-01

    Rice is an important model species for the Poaceae and other monocotyledonous plants. With the availability of a near-complete, finished, and annotated rice genome, we performed genome level comparisons between rice and all plant species in which large genomic or transcriptomic data sets are available to determine the utility of cross-species sequence for structural and functional annotation of the rice genome. Through comparative analyses with four plant genome sequence data sets and transcript assemblies from 185 plant species, we were able to confirm and improve the structural annotation of the rice genome. Support for 38,109 (89.3%) of the total 42,653 nontransposable element-related genes in the rice genome in the form of a rice expressed sequence tag, full-length cDNA, or plant homolog from our comparative analyses could be found. Although the majority of the putative homologs were obtained from Poaceae species, putative homologs were identified in dicotyledonous angiosperms, gymnosperms, and other plants such as algae, moss, and fern. A set of rice genes (7669) lacking a putative homolog was identified which may be lineage-specific genes that evolved after speciation and have a role in species diversity. Improvements to the current rice gene structural annotation could be identified from our comparative alignments and we were able to identify 487 genes which were mostly likely missed in the current rice genome annotation and another 500 genes for structural annotation review. We were able to demonstrate the utility of cross-species comparative alignments in the identification of noncoding sequences and in confirmation of gene nesting in rice. PMID:17284677

  19. A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS

    PubMed Central

    IONITA-LAZA, IULIANA; MCCALLUM, KENNETH; XU, BIN; BUXBAUM, JOSEPH

    2015-01-01

    Over the past few years, substantial effort has been put into the functional annotation of variation in human genome sequence. Such annotations can play a critical role in identifying putatively causal variants among the abundant natural variation that occurs at a locus of interest. The main challenges in using these various annotations include their large numbers, and their diversity. Here we develop an unsupervised approach to integrate these different annotations into one measure of functional importance (Eigen), that, unlike most existing methods, is not based on any labeled training data. We show that the resulting meta-score has better discriminatory ability using disease associated and putatively benign variants from published studies (in both coding and noncoding regions) compared with the recently proposed CADD score. Across varied scenarios, the Eigen score performs generally better than any single individual annotation, representing a powerful single functional score that can be incorporated in fine-mapping studies. PMID:26727659

  20. Insect genome content phylogeny and functional annotation of core insect genomes.

    PubMed

    Rosenfeld, Jeffrey A; Foox, Jonathan; DeSalle, Rob

    2016-04-01

    Twenty-one fully sequenced and well annotated insect genomes were examined for genome content in a phylogenetic context. Gene presence/absence matrices and phylogenetic trees were constructed using several phylogenetic criteria. The role of e-value on phylogenetic analysis and genome content characterization is examined using scaled e-value cutoffs and a single linkage clustering approach to orthology determination. Previous studies have focused on the role of gene loss in terminals in the insect tree of life. The present study examines several common ancestral nodes in the insect tree. We suggest that the common ancestors of major insect groups like Diptera, Hymenoptera, Hemiptera and Holometabola experience more gene gain than gene loss. This suggests that as major insect groups arose, their genomic repertoire expanded through gene duplication (segmental duplications), followed by contraction by gene loss in specific terminal lineages. In addition, we examine the functional significance of the loss and gain of genes in the divergence of some of the major insect groups. PMID:26549428

  1. Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data

    PubMed Central

    2014-01-01

    Background Whole Exome Sequencing (WES) is one of the most used and cost-effective next generation technologies that allows sequencing of all nuclear exons. Off-target regions may be captured if they present high sequence similarity with baits. Bioinformatics tools have been optimized to retrieve a large amount of WES off-target mitochondrial DNA (mtDNA), by exploiting the aspecificity of probes, partially overlapping to Nuclear mitochondrial Sequences (NumtS). The 1000 Genomes project represents one of the widest resources to extract mtDNA sequences from WES data, considering the large effort the scientific community is undertaking to reconstruct human population history using mtDNA as marker, and the involvement of mtDNA in pathology. Results A previously published pipeline aimed at assembling mitochondrial genomes from off-target WES reads and further improved to detect insertions and deletions (indels) and heteroplasmy in a dataset of 1242 samples from the 1000 Genomes project, enabled to obtain a nearly complete mitochondrial genome from 943 samples (76% analyzed exomes). The robustness of our computational strategy was highlighted by the reduction of reads amount recognized as mitochondrial in the original annotation produced by the Consortium, due to NumtS filtering. An accurate survey was carried out on 1242 individuals. 215 indels, mostly heteroplasmic, and 3407 single base variants were mapped. A homogeneous mismatches distribution was observed along the whole mitochondrial genome, while a lower frequency of indels was found within protein-coding regions, where frameshift mutations may be deleterious. The majority of indels and mismatches found were not previously annotated in mitochondrial databases since conventional sequencing methods were limited to homoplasmy or quasi-homoplasmy detection. Intriguingly, upon filtering out non haplogroup-defining variants, we detected a widespread population occurrence of rare events predicted to be damaging

  2. Genome-wide functional annotation of Phomopsis longicolla isolate MSPL 10-6.

    PubMed

    Darwish, Omar; Li, Shuxian; Matthews, Benjamin; Alkharouf, Nadim

    2016-06-01

    Phomopsis seed decay of soybean is caused primarily by the seed-borne fungal pathogen Phomopsis longicolla (syn. Diaporthe longicolla). This disease severely decreases soybean seed quality, reduces seedling vigor and stand establishment, and suppresses yield. It is one of the most economically important soybean diseases. In this study we annotated the entire genome of P. longicolla isolate MSPL 10-6, which was isolated from field-grown soybean seed in Mississippi, USA. This study represents the first reported genome-wide functional annotation of a seed borne fungal pathogen in the Diaporthe-Phomopsis complex. The P. longicolla genome annotation will enable research into the genetic basis of fungal infection of soybean seed and provide information for the study of soybean-fungal interactions. The genome annotation will also be a valuable resource for the research and agricultural communities. It will aid in the development of new control strategies for this pathogen. The annotations can be found from: http://bioinformatics.towson.edu/phomopsis_longicolla/download.html. NCBI accession number is: AYRD00000000. PMID:27222801

  3. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation

    PubMed Central

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2015-01-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs. PMID:26862572

  4. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation.

    PubMed

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2016-03-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs. PMID:26862572

  5. Whole genome sequence and genome annotation of Colletotrichum acutatum, causal agent of anthracnose in pepper plants in South Korea.

    PubMed

    Han, Joon-Hee; Chon, Jae-Kyung; Ahn, Jong-Hwa; Choi, Ik-Young; Lee, Yong-Hwan; Kim, Kyoung Su

    2016-06-01

    Colletotrichum acutatum is a destructive fungal pathogen which causes anthracnose in a wide range of crops. Here we report the whole genome sequence and annotation of C. acutatum strain KC05, isolated from an infected pepper in Kangwon, South Korea. Genomic DNA from the KC05 strain was used for the whole genome sequencing using a PacBio sequencer and the MiSeq system. The KC05 genome was determined to be 52,190,760 bp in size with a G + C content of 51.73% in 27 scaffolds and to contain 13,559 genes with an average length of 1516 bp. Gene prediction and annotation were performed by incorporating RNA-Seq data. The genome sequence of the KC05 was deposited at DDBJ/ENA/GenBank under the accession number LUXP00000000. PMID:27114908

  6. GAMOLA: a new local solution for sequence annotation and analyzing draft and finished prokaryotic genomes.

    PubMed

    Altermann, Eric; Klaenhammer, Todd R

    2003-01-01

    Laboratories working with draft phase genomes have specific software needs, such as the unattended processing of hundreds of single scaffolds and subsequent sequence annotation. In addition, it is critical to follow the "movement" and the manual annotation of single open reading frames (ORFs) within the successive sequence updates. Even with finished genomes, regular database updates can lead to significant changes in the annotation of single ORFs. In functional genomics it is important to mine data and identify new genetic targets rapidly and easily. Often there is no need for sophisticated relational databases (RDB) that greatly reduce the system-independent access of the results. Another aspect is the internet dependency of most software packages. If users are working with confidential data, this dependency poses a security issue. GAMOLA was designed to handle the numerous scaffolds and changing contents of draft phase genomes in an automated process and stores the results for each predicted ORF in flatfile databases. In addition, annotation transfers, ORF designation tracking, Blast comparisons, and primer design for whole genome microarrays have been implemented. The software is available under the license of North Carolina State University. A website and a downloadable example are accessible under (http://fsweb2.schaub. ncsu.edu/TRKwebsite/index.htm). PMID:14506845

  7. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  8. Genome sequencing and annotation of Cellulomonas sp. HZM.

    PubMed

    Chua, Patric; Har, Zi Mei; Austin, Christopher M; Yule, Catherine M; Dykes, Gary A; Lee, Sui Mae

    2015-09-01

    We report the draft genome sequence of Cellulomonas sp. HZM, isolated from a tropical peat swamp forest. The draft genome size is 3,559,280 bp with a G + C content of 73% and contains 3 rRNA sequences (single copies of 5S, 16S and 23S rRNA).

  9. Genome sequencing and annotation of Cellulomonas sp. HZM

    PubMed Central

    Chua, Patric; Har, Zi Mei; Austin, Christopher M.; Yule, Catherine M.; Dykes, Gary A.; Lee, Sui Mae

    2015-01-01

    We report the draft genome sequence of Cellulomonas sp. HZM, isolated from a tropical peat swamp forest. The draft genome size is 3,559,280 bp with a G + C content of 73% and contains 3 rRNA sequences (single copies of 5S, 16S and 23S rRNA). PMID:26484221

  10. Whole genome de novo sequencing and genome annotation of the world popular cultivated edible mushroom, Lentinula edodes.

    PubMed

    Shim, Donghwan; Park, Sin-Gi; Kim, Kangmin; Bae, Wonsil; Lee, Gir Won; Ha, Byeong-Suk; Ro, Hyeon-Su; Kim, Myungkil; Ryoo, Rhim; Rhee, Sung-Keun; Nou, Ill-Sup; Koo, Chang-Duck; Hong, Chang Pyo; Ryu, Hojin

    2016-04-10

    Lentinula edodes, the popular shiitake mushroom, is one of the most important cultivated edible mushrooms. It is used as a food and for medicinal purposes. Here, we present the 46.1 Mb draft genome of L. edodes, comprising 13,028 predicted gene models. The genome assembly consists of 31 scaffolds. Gene annotation provides key information about various signaling pathways and secondary metabolites. This genomic information should help establish the molecular genetic markers for MAS/MAB and increase our understanding of the genome structure and function.

  11. The draft genome sequence and annotation of the desert woodrat Neotoma lepida.

    PubMed

    Campbell, Michael; Oakeson, Kelly F; Yandell, Mark; Halpert, James R; Dearing, Denise

    2016-09-01

    We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida). This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata) and the juniper shrub (Juniperus monosperma). The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000. PMID:27408812

  12. ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets

    PubMed Central

    Killion, Patrick J; Iyer, Vishwanath R

    2008-01-01

    ArrayPlex is a software package that centrally provides a large number of flexible toolsets useful for functional genomics, including microarray data storage, quality assessments, data visualization, gene annotation retrieval, statistical tests, genomic sequence retrieval and motif analysis. It uses a client-server architecture based on open source components, provides graphical, command-line, and programmatic access to all needed resources, and is extensible by virtue of a documented application programming interface. ArrayPlex is available at . PMID:19014503

  13. The draft genome sequence and annotation of the desert woodrat Neotoma lepida.

    PubMed

    Campbell, Michael; Oakeson, Kelly F; Yandell, Mark; Halpert, James R; Dearing, Denise

    2016-09-01

    We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida). This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata) and the juniper shrub (Juniperus monosperma). The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000.

  14. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216

    PubMed Central

    Franco, Mario E. E.; López, Silvina; Medina, Rocio; Saparrat, Mario C. N.

    2015-01-01

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation. PMID:26404600

  15. Genome Sequence and Annotation of Colletotrichum higginsianum, a Causal Agent of Crucifer Anthracnose Disease

    PubMed Central

    Zampounis, Antonios; Pigné, Sandrine; Dallery, Jean-Félix; Wittenberg, Alexander H. J.; Zhou, Shiguo; Schwartz, David C.

    2016-01-01

    Colletotrichum higginsianum is an ascomycete fungus causing anthracnose disease on numerous cultivated plants in the family Brassicaceae, as well as the model plant Arabidopsis thaliana. We report an assembly of the nuclear genome and gene annotation of this pathogen, which was obtained using a combination of PacBio long-read sequencing and optical mapping. PMID:27540062

  16. Draft Genome Sequence and Gene Annotation of the Uropathogenic Bacterium Proteus mirabilis Pr2921

    PubMed Central

    Giorello, F. M.; Romero, V.; Farias, J.; Scavone, P.; Umpiérrez, A.; Zunino, P.

    2016-01-01

    Here, we report the genome sequence of Proteus mirabilis Pr2921, a uropathogenic bacterium that can cause severe complicated urinary tract infections. After gene annotation, we identified two additional copies of ucaA, one of the most studied fimbrial protein genes, and other fimbriae related-proteins that are not present in P. mirabilis HI4320. PMID:27340058

  17. Genome Sequence and Annotation of Colletotrichum higginsianum, a Causal Agent of Crucifer Anthracnose Disease.

    PubMed

    Zampounis, Antonios; Pigné, Sandrine; Dallery, Jean-Félix; Wittenberg, Alexander H J; Zhou, Shiguo; Schwartz, David C; Thon, Michael R; O'Connell, Richard J

    2016-01-01

    Colletotrichum higginsianum is an ascomycete fungus causing anthracnose disease on numerous cultivated plants in the family Brassicaceae, as well as the model plant Arabidopsis thaliana We report an assembly of the nuclear genome and gene annotation of this pathogen, which was obtained using a combination of PacBio long-read sequencing and optical mapping. PMID:27540062

  18. Genome sequencing and annotation of Aeromonas sp. HZM

    PubMed Central

    Chua, Patric; Har, Zi Mei; Austin, Christopher M.; Yule, Catherine M.; Dykes, Gary A.; Lee, Sui Mae

    2015-01-01

    We report the draft genome sequence of Aeromonas sp. strain HZM, isolated from tropical peat swamp forest soil. The draft genome size is 4,451,364 bp with a G + C content of 61.7% and contains 10 rRNA sequences (eight copies of 5S rRNA genes, single copy of 16S and 23S rRNA each). The genome sequence can be accessed at DDBJ/EMBL/GenBank under the accession no. JEMQ00000000. PMID:26484220

  19. Resources for Biological Annotation of the Drosophila Genome

    SciTech Connect

    Gerald M. Rubin

    2005-08-08

    This project supported seed money for the development of cDNA and genetic resources to support studies of the Drosophila melanogaster genome. Key publications supported by this work that provide additional detail: (1) ''The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes''; and (2) ''The Berkeley Drosophila Genome Project gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes''.

  20. Genome sequencing and annotation of Acinetobacter junii strain MTCC 11364.

    PubMed

    Khatri, Indu; Singh, Nitin Kumar; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 3.5 Mb draft genome of the Acinetobacter junii strain MTCC 11364. The genome has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S, 16S) and 64 aminoacyl-tRNA synthetase genes.

  1. Genome sequencing, annotation of Citrobacter freundii strain GTC 09479.

    PubMed

    Kimura, Kazuyuki; Kumar, Shailesh; Takeo, Masahiro; Mayilraj, Shanmugam

    2014-12-01

    We report the 4.9-Mb genome sequence of Citrobacter freundii strain GTC 09479, isolated from urine sample collected during the year 1983 at Gifu University Graduate School of Medicine, Japan. This draft genome consist of 4,899,578 bp with 51.62% G + C, 4,574 predicted CDSs, 72 tRNAs and 10 rRNAs.

  2. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  3. Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data.

    PubMed

    Lohse, Marc; Nagel, Axel; Herter, Thomas; May, Patrick; Schroda, Michael; Zrenner, Rita; Tohge, Takayuki; Fernie, Alisdair R; Stitt, Mark; Usadel, Björn

    2014-05-01

    Next-generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan 'BIN' ontology, which is tailored for functional annotation of plant 'omics' data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan-to-GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator.

  4. Genome sequencing and annotation of Amycolatopsis azurea DSM 43854(T).

    PubMed

    Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    We report the 9.2 Mb genome of the azureomycin A and B antibiotic producing strain Amycolatopsis azurea isolated from a Japanese soil sample. The draft genome of strain DSM 43854(T) consists of 9,223,451 bp with a G + C content of 69.0% and the genome contains 3 rRNA genes (5S-23S-16S) and 58 aminoacyl-tRNA synthetase genes. The homology searches revealed that the PKS gene clusters are supposed to be responsible for the biosynthesis of naptomycin, macbecin, rifamycin, mitomycin, maduropeptin enediyne, neocarzinostatin enediyne, C-1027 enediyne, calicheamicin enediyne, landomycin, simocyclinone, medermycin, granaticin, polyketomycin, teicoplanin, balhimycin, vancomycin, staurosporine, rubradirin and complestatin. PMID:26484067

  5. Genome sequencing and annotation of Amycolatopsis azurea DSM 43854(T).

    PubMed

    Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    We report the 9.2 Mb genome of the azureomycin A and B antibiotic producing strain Amycolatopsis azurea isolated from a Japanese soil sample. The draft genome of strain DSM 43854(T) consists of 9,223,451 bp with a G + C content of 69.0% and the genome contains 3 rRNA genes (5S-23S-16S) and 58 aminoacyl-tRNA synthetase genes. The homology searches revealed that the PKS gene clusters are supposed to be responsible for the biosynthesis of naptomycin, macbecin, rifamycin, mitomycin, maduropeptin enediyne, neocarzinostatin enediyne, C-1027 enediyne, calicheamicin enediyne, landomycin, simocyclinone, medermycin, granaticin, polyketomycin, teicoplanin, balhimycin, vancomycin, staurosporine, rubradirin and complestatin.

  6. Citrus sinensis annotation project (CAP): a comprehensive database for sweet orange genome.

    PubMed

    Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  7. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  8. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  9. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci.

    PubMed

    Trynka, Gosia; Westra, Harm-Jan; Slowikowski, Kamil; Hu, Xinli; Xu, Han; Stranger, Barbara E; Klein, Robert J; Han, Buhm; Raychaudhuri, Soumya

    2015-07-01

    Identifying genomic annotations that differentiate causal from trait-associated variants is essential to fine mapping disease loci. Although many studies have identified non-coding functional annotations that overlap disease-associated variants, these annotations often colocalize, complicating the ability to use these annotations for fine mapping causal variation. We developed a statistical approach (Genomic Annotation Shifter [GoShifter]) to assess whether enriched annotations are able to prioritize causal variation. GoShifter defines the null distribution of an annotation overlapping an allele by locally shifting annotations; this approach is less sensitive to biases arising from local genomic structure than commonly used enrichment methods that depend on SNP matching. Local shifting also allows GoShifter to identify independent causal effects from colocalizing annotations. Using GoShifter, we confirmed that variants in expression quantitative trail loci drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites and independently through 3' UTR regulation. We also showed that (1) 15%-36% of trait-associated loci map to DHSs independently of other annotations; (2) loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies; (3) variants associated with height are highly enriched in embryonic stem cell DHSs; and (4) we can effectively prioritize causal variation at specific loci.

  10. Genome sequencing and annotation of multidrug resistant Mycobacterium tuberculosis (MDR-TB) PR10 strain.

    PubMed

    Halim, Mohd Zakihalani A; Jaafar, Mohammad Maaruf; Teh, Lay Kek; Ismail, Mohamad Izwan; Lee, Lian Shien; Ngeow, Yun Fong; Nor, Norazmi Mohd; Zainuddin, Zainul Fadziruddin; Tang, Thean Hock; Najimudin, Mohd Nazalan Mohd; Salleh, Mohd Zaki

    2016-03-01

    Here, we report the draft genome sequence and annotation of a multidrug resistant Mycobacterium tuberculosis strain PR10 (MDR-TB PR10) isolated from a patient diagnosed with tuberculosis. The size of the draft genome MDR-TB PR10 is 4.34 Mbp with 65.6% of G + C content and consists of 4637 predicted genes. The determinants were categorized by RAST into 400 subsystems with 4286 coding sequences and 50 RNAs. The whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010968.

  11. An Updated genome annotation for the model marine bacterium Ruegeria pomeroyi DSS-3

    PubMed Central

    2014-01-01

    When the genome of Ruegeria pomeroyi DSS-3 was published in 2004, it represented the first sequence from a heterotrophic marine bacterium. Over the last ten years, the strain has become a valuable model for understanding the cycling of sulfur and carbon in the ocean. To ensure that this genome remains useful, we have updated 69 genes to incorporate functional annotations based on new experimental data, and improved the identification of 120 protein-coding regions based on proteomic and transcriptomic data. We review the progress made in understanding the biology of R. pomeroyi DSS-3 and list the changes made to the genome. PMID:25780504

  12. An Updated genome annotation for the model marine bacterium Ruegeria pomeroyi DSS-3.

    PubMed

    Rivers, Adam R; Smith, Christa B; Moran, Mary Ann

    2014-01-01

    When the genome of Ruegeria pomeroyi DSS-3 was published in 2004, it represented the first sequence from a heterotrophic marine bacterium. Over the last ten years, the strain has become a valuable model for understanding the cycling of sulfur and carbon in the ocean. To ensure that this genome remains useful, we have updated 69 genes to incorporate functional annotations based on new experimental data, and improved the identification of 120 protein-coding regions based on proteomic and transcriptomic data. We review the progress made in understanding the biology of R. pomeroyi DSS-3 and list the changes made to the genome.

  13. Genome sequencing and annotation of multidrug resistant Mycobacterium tuberculosis (MDR-TB) PR10 strain

    PubMed Central

    Halim, Mohd Zakihalani A.; Jaafar, Mohammad Maaruf; Teh, Lay Kek; Ismail, Mohamad Izwan; Lee, Lian Shien; Ngeow, Yun Fong; Nor, Norazmi Mohd; Zainuddin, Zainul Fadziruddin; Tang, Thean Hock; Najimudin, Mohd Nazalan Mohd; Salleh, Mohd Zaki

    2016-01-01

    Here, we report the draft genome sequence and annotation of a multidrug resistant Mycobacterium tuberculosis strain PR10 (MDR-TB PR10) isolated from a patient diagnosed with tuberculosis. The size of the draft genome MDR-TB PR10 is 4.34 Mbp with 65.6% of G + C content and consists of 4637 predicted genes. The determinants were categorized by RAST into 400 subsystems with 4286 coding sequences and 50 RNAs. The whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010968. PMID:26981419

  14. Genome sequencing and annotation of multidrug resistant Mycobacterium tuberculosis (MDR-TB) PR10 strain.

    PubMed

    Halim, Mohd Zakihalani A; Jaafar, Mohammad Maaruf; Teh, Lay Kek; Ismail, Mohamad Izwan; Lee, Lian Shien; Ngeow, Yun Fong; Nor, Norazmi Mohd; Zainuddin, Zainul Fadziruddin; Tang, Thean Hock; Najimudin, Mohd Nazalan Mohd; Salleh, Mohd Zaki

    2016-03-01

    Here, we report the draft genome sequence and annotation of a multidrug resistant Mycobacterium tuberculosis strain PR10 (MDR-TB PR10) isolated from a patient diagnosed with tuberculosis. The size of the draft genome MDR-TB PR10 is 4.34 Mbp with 65.6% of G + C content and consists of 4637 predicted genes. The determinants were categorized by RAST into 400 subsystems with 4286 coding sequences and 50 RNAs. The whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession number CP010968. PMID:26981419

  15. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    PubMed

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function. PMID:27388471

  16. Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences.

    PubMed

    Sharma, Ashok K; Gupta, Ankit; Kumar, Sanjiv; Dhakan, Darshan B; Sharma, Vineet K

    2015-07-01

    Functional annotation of the gigantic metagenomic data is one of the major time-consuming and computationally demanding tasks, which is currently a bottleneck for the efficient analysis. The commonly used homology-based methods to functionally annotate and classify proteins are extremely slow. Therefore, to achieve faster and accurate functional annotation, we have developed an orthology-based functional classifier 'Woods' by using a combination of machine learning and similarity-based approaches. Woods displayed a precision of 98.79% on independent genomic dataset, 96.66% on simulated metagenomic dataset and >97% on two real metagenomic datasets. In addition, it performed >87 times faster than BLAST on the two real metagenomic datasets. Woods can be used as a highly efficient and accurate classifier with high-throughput capability which facilitates its usability on large metagenomic datasets. PMID:25863333

  17. Genome Annotation of Burkholderia sp. SJ98 with Special Focus on Chemotaxis Genes

    PubMed Central

    Kumar, Shailesh; Vikram, Surendra; Raghava, Gajendra Pal Singh

    2013-01-01

    Burkholderia sp. strain SJ98 has the chemotactic activity towards nitroaromatic and chloronitroaromatic compounds. Recently our group published draft genome of strain SJ98. In this study, we further sequence and annotate the genome of stain SJ98 to exploit the potential of this bacterium. We specifically annotate its chemotaxis genes and methyl accepting chemotaxis proteins. Genome of Burkholderia sp. SJ98 was annotated using PGAAP pipeline that predicts 7,268 CDSs, 52 tRNAs and 3 rRNAs. Our analysis based on phylogenetic and comparative genomics suggest that Burkholderia sp. YI23 is closest neighbor of the strain SJ98. The genes involved in the chemotaxis of strain SJ98 were compared with genes of closely related Burkholderia strains (i.e. YI23, CCGE 1001, CCGE 1002, CCGE 1003) and with well characterized bacterium E. coli K12. It was found that strain SJ98 has 37 che genes including 19 methyl accepting chemotaxis proteins that involved in sensing of different attractants. Chemotaxis genes have been found in a cluster along with the flagellar motor proteins. We also developed a web resource that provides comprehensive information on strain SJ98 that includes all analysis data (http://crdd.osdd.net/raghava/genomesrs/burkholderia/). PMID:23940608

  18. Genome annotation of Burkholderia sp. SJ98 with special focus on chemotaxis genes.

    PubMed

    Kumar, Shailesh; Vikram, Surendra; Raghava, Gajendra Pal Singh

    2013-01-01

    Burkholderia sp. strain SJ98 has the chemotactic activity towards nitroaromatic and chloronitroaromatic compounds. Recently our group published draft genome of strain SJ98. In this study, we further sequence and annotate the genome of stain SJ98 to exploit the potential of this bacterium. We specifically annotate its chemotaxis genes and methyl accepting chemotaxis proteins. Genome of Burkholderia sp. SJ98 was annotated using PGAAP pipeline that predicts 7,268 CDSs, 52 tRNAs and 3 rRNAs. Our analysis based on phylogenetic and comparative genomics suggest that Burkholderia sp. YI23 is closest neighbor of the strain SJ98. The genes involved in the chemotaxis of strain SJ98 were compared with genes of closely related Burkholderia strains (i.e. YI23, CCGE 1001, CCGE 1002, CCGE 1003) and with well characterized bacterium E. coli K12. It was found that strain SJ98 has 37 che genes including 19 methyl accepting chemotaxis proteins that involved in sensing of different attractants. Chemotaxis genes have been found in a cluster along with the flagellar motor proteins. We also developed a web resource that provides comprehensive information on strain SJ98 that includes all analysis data (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).

  19. Whole genome profiling physical map and ancestral annotation of tobacco Hicks Broadleaf.

    PubMed

    Sierro, Nicolas; van Oeveren, Jan; van Eijk, Michiel J T; Martin, Florian; Stormo, Keith E; Peitsch, Manuel C; Ivanov, Nikolai V

    2013-09-01

    Genomics-based breeding of economically important crops such as banana, coffee, cotton, potato, tobacco and wheat is often hampered by genome size, polyploidy and high repeat content. We adapted sequence-based whole-genome profiling (WGP™) technology to obtain insight into the polyploidy of the model plant Nicotiana tabacum (tobacco). N. tabacum is assumed to originate from a hybridization event between ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis approximately 200,000 years ago. This resulted in tobacco having a haploid genome size of 4500 million base pairs, approximately four times larger than the related tomato (Solanum lycopersicum) and potato (Solanum tuberosum) genomes. In this study, a physical map containing 9750 contigs of bacterial artificial chromosomes (BACs) was constructed. The mean contig size was 462 kbp, and the calculated genome coverage equaled the estimated tobacco genome size. We used a method for determination of the ancestral origin of the genome by annotation of WGP sequence tags. This assignment agreed with the ancestral annotation available from the tobacco genetic map, and may be used to investigate the evolution of homoeologous genome segments after polyploidization. The map generated is an essential scaffold for the tobacco genome. We propose the combination of WGP physical mapping technology and tag profiling of ancestral lines as a generally applicable method to elucidate the ancestral origin of genome segments of polyploid species. The physical mapping of genes and their origins will enable application of biotechnology to polyploid plants aimed at accelerating and increasing the precision of breeding for abiotic and biotic stress resistance.

  20. Whole genome profiling physical map and ancestral annotation of tobacco Hicks Broadleaf

    PubMed Central

    Sierro, Nicolas; van Oeveren, Jan; van Eijk, Michiel J T; Martin, Florian; Stormo, Keith E; Peitsch, Manuel C; Ivanov, Nikolai V

    2013-01-01

    Genomics-based breeding of economically important crops such as banana, coffee, cotton, potato, tobacco and wheat is often hampered by genome size, polyploidy and high repeat content. We adapted sequence-based whole-genome profiling (WGP™) technology to obtain insight into the polyploidy of the model plant Nicotiana tabacum (tobacco). N. tabacum is assumed to originate from a hybridization event between ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis approximately 200 000 years ago. This resulted in tobacco having a haploid genome size of 4500 million base pairs, approximately four times larger than the related tomato (Solanum lycopersicum) and potato (Solanum tuberosum) genomes. In this study, a physical map containing 9750 contigs of bacterial artificial chromosomes (BACs) was constructed. The mean contig size was 462 kbp, and the calculated genome coverage equaled the estimated tobacco genome size. We used a method for determination of the ancestral origin of the genome by annotation of WGP sequence tags. This assignment agreed with the ancestral annotation available from the tobacco genetic map, and may be used to investigate the evolution of homoeologous genome segments after polyploidization. The map generated is an essential scaffold for the tobacco genome. We propose the combination of WGP physical mapping technology and tag profiling of ancestral lines as a generally applicable method to elucidate the ancestral origin of genome segments of polyploid species. The physical mapping of genes and their origins will enable application of biotechnology to polyploid plants aimed at accelerating and increasing the precision of breeding for abiotic and biotic stress resistance. PMID:23672264

  1. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  2. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR

    PubMed Central

    Yang, Hui; Wang, Kai

    2016-01-01

    Recent developments in sequencing techniques have enabled rapid and high-throughput generation of sequence data, democratizing the ability to compile information on large amounts of genetic variations in individual laboratories. However, there is a growing gap between the generation of raw sequencing data and the extraction of meaningful biological information. Here, we describe a protocol to use the ANNOVAR (ANNOtate VARiation) software to facilitate fast and easy variant annotations, including gene-based, region-based and filter-based annotations on a variant call format (VCF) file generated from human genomes. We further describe a protocol for gene-based annotation of a newly sequenced nonhuman species. Finally, we describe how to use a user-friendly and easily accessible web server called wANNOVAR to prioritize candidate genes for a Mendelian disease. The variant annotation protocols take 5–30 min of computer time, depending on the size of the variant file, and 5–10 min of hands-on time. In summary, through the command-line tool and the web server, these protocols provide a convenient means to analyze genetic variants generated in humans and other species. PMID:26379229

  3. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  4. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  5. Genome-wide functional annotation and structural verification of metabolic ORFeome of Chlamydomonas reinhardtii

    PubMed Central

    2011-01-01

    Background Recent advances in the field of metabolic engineering have been expedited by the availability of genome sequences and metabolic modelling approaches. The complete sequencing of the C. reinhardtii genome has made this unicellular alga a good candidate for metabolic engineering studies; however, the annotation of the relevant genes has not been validated and the much-needed metabolic ORFeome is currently unavailable. We describe our efforts on the functional annotation of the ORF models released by the Joint Genome Institute (JGI), prediction of their subcellular localizations, and experimental verification of their structural annotation at the genome scale. Results We assigned enzymatic functions to the translated JGI ORF models of C. reinhardtii by reciprocal BLAST searches of the putative proteome against the UniProt and AraCyc enzyme databases. The best match for each translated ORF was identified and the EC numbers were transferred onto the ORF models. Enzymatic functional assignment was extended to the paralogs of the ORFs by clustering ORFs using BLASTCLUST. In total, we assigned 911 enzymatic functions, including 886 EC numbers, to 1,427 transcripts. We further annotated the enzymatic ORFs by prediction of their subcellular localization. The majority of the ORFs are predicted to be compartmentalized in the cytosol and chloroplast. We verified the structure of the metabolism-related ORF models by reverse transcription-PCR of the functionally annotated ORFs. Following amplification and cloning, we carried out 454FLX and Sanger sequencing of the ORFs. Based on alignment of the 454FLX reads to the ORF predicted sequences, we obtained more than 90% coverage for more than 80% of the ORFs. In total, 1,087 ORF models were verified by 454 and Sanger sequencing methods. We obtained expression evidence for 98% of the metabolic ORFs in the algal cells grown under constant light in the presence of acetate. Conclusions We functionally annotated approximately 1

  6. Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation.

    PubMed

    Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias

    2014-08-01

    Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications.

  7. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    DOE PAGES

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle

    2014-10-09

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accuratemore » comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.« less

  8. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    SciTech Connect

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle

    2014-10-09

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accurate comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.

  9. Expanded microbial genome coverage and improved protein family annotation in the COG database

    PubMed Central

    Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.

    2015-01-01

    Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the

  10. Advancing Trypanosoma brucei genome annotation through ribosome profiling and spliced leader mapping.

    PubMed

    Parsons, Marilyn; Ramasamy, Gowthaman; Vasconcelos, Elton J R; Jensen, Bryan C; Myler, Peter J

    2015-08-01

    Since the initial publication of the trypanosomatid genomes, curation has been ongoing. Here we make use of existing Trypanosoma brucei ribosome profiling data to provide evidence of ribosome occupancy (and likely translation) of mRNAs from 225 currently unannotated coding sequences (CDSs). A small number of these putative genes correspond to extra copies of previously annotated genes, but 85% are novel. The median size of these novels CDSs is small (81 aa), indicating that past annotation work has excelled at detecting large CDSs. Of the unique CDSs confirmed here, over half have candidate orthologues in other trypanosomatid genomes, most of which were not yet annotated as protein-coding genes. Nonetheless, approximately one-third of the new CDSs were found only in T. brucei subspecies. Using ribosome footprints, RNA-Seq and spliced leader mapping data, we updated previous work to definitively revise the start sites for 414 CDSs as compared to the current gene models. The data pointed to several regions of the genome that had sequence errors that altered coding region boundaries. Finally, we consolidated this data with our previous work to propose elimination of 683 putative genes as protein-coding and arrive at a view of the translatome of slender bloodstream and procyclic culture form T. brucei.

  11. Genome sequencing and annotation of Amycolatopsis vancoresmycina strain DSM 44592(T).

    PubMed

    Kaur, Navjot; Kumar, Shailesh; Mayilraj, Shanmugam

    2014-12-01

    We report the 9.0-Mb draft genome of Amycolatopsis vancoresmycina strain DSM 44592(T), isolated from Indian soil sample; produces antibiotic vancoresmycin. Draft genome of strain DSM44592T consists of 9,037,069 bp with a G+C content of 71.79% and 8340 predicted protein coding genes and 57 RNAs. RAST annotation indicates that strains Streptomyces sp. AA4 (score 521), Saccharomonospora viridis DSM 43017 (score 400) and Actinosynnema mirum DSM 43827 (score 372) are the closest neighbors of the strain DSM 44592(T).

  12. The physics of DNA and the annotation of the Plasmodium falciparum genome.

    PubMed

    Yeramian, E

    2000-09-19

    A gene identification procedure is formulated, based on large-scale structural analyses of genomic sequences. The structural property is the physical - thermal - stability of the DNA double-helix, as described by the classical helix-coil model. The analyses are detailed for the Plasmodium falciparum genome, which represents one of the most difficult cases for the gene identification problem (notably because of the extreme AT-richness of the genome). In this genome, the coding domains (either uninterrupted genes or exons in split genes) are accurately identified as regions of high thermal stability. The conclusion is based on the study of the available cloned genes, of which 17 examples are described in detail. These examples demonstrate that the physical criterion is valid for the detection of coding regions whose lengths extend from a few base pairs up to several thousand base pairs. Accordingly, the structural analyses can provide a powerful and convenient tool for the identification of complex genes in the P. falciparum genome. The limits of such a scheme are discussed. The gene identification procedure is applied to the completely sequenced chromosomes (2 and 3), and the results are compared with the database annotations. The structural analyses suggest more or less extensive revision to the annotations, and also allow new putative genes to be identified in the chromosome sequences. Several examples of such new genes are described in detail.

  13. A bioinformatics approach to reanalyze the genome annotation of kinetoplastid protozoan parasite Leishmania donovani.

    PubMed

    Pawar, Harsh; Kulkarni, Aditi; Dixit, Tanwi; Chaphekar, Deepa; Patole, Milind S

    2014-12-01

    Leishmania donovani is a kinetoplastid protozoan parasite which causes the fatal disease visceral leishmaniasis in humans. Genome sequencing of L. donovani revealed information about the arrangement of genes and genome architecture. After curation of the genome sequence, many genes in L. donovani were assigned as truncated or "partial" genes by the genome sequencing group. In the present study, we have carried out an extensive analysis and attempted to improve the gene models of these partial genes. Our analysis resulted in the identification of 308 partial genes in L. donovani, which were further categorized as C-terminal extensions, joining of genes, tandemly repeated paralogs and wrong chromosomal assignments. We have analyzed each of these genes from these categories and have improved the annotation of existing gene models in L. donovani. Some of these corrections have been confirmed by mass spectrometry derived peptide data from our previous comparative proteogenomics study in L. donovani.

  14. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported

  15. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation

    PubMed Central

    Stepanov, Victor G.; Tirumalai, Madhan R.; Montazari, Saied; Checinska, Aleksandra; Venkateswaran, Kasthuri

    2016-01-01

    Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV) radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains. PMID:27351589

  16. A field guide to whole-genome sequencing, assembly and annotation

    PubMed Central

    Ekblom, Robert; Wolf, Jochen B W

    2014-01-01

    Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects. PMID:25553065

  17. Re-annotation of the genome sequence of Helicobacter pylori 26695.

    PubMed

    Resende, Tiago; Correia, Daniela M; Rocha, Miguel; Rocha, Isabel

    2013-11-15

    Helicobacter pylori is a pathogenic bacterium that colonizes the human epithelia, causing duodenal and gastric ulcers, and gastric cancer. The genome of H. pylori 26695 has been previously sequenced and annotated. In addition, two genome-scale metabolic models have been developed. In order to maintain accurate and relevant information on coding sequences (CDS) and to retrieve new information, the assignment of new functions to Helicobacter pylori 26695s genes was performed in this work. The use of software tools, on-line databases and an annotation pipeline for inspecting each gene allowed the attribution of validated EC numbers and TC numbers to metabolic genes encoding enzymes and transport proteins, respectively. 1212 genes encoding proteins were identified in this annotation, being 712 metabolic genes and 500 non-metabolic, while 191 new functions were assignment to the CDS of this bacterium. This information provides relevant biological information for the scientific community dealing with this organism and can be used as the basis for a new metabolic model reconstruction.

  18. Genome sequencing and annotation of a Campylobacter coli strain isolated from milk with multidrug resistance.

    PubMed

    Liu, Kun C; Jinneman, Karen C; Neal-McKinney, Jason; Wu, Wen-Hsin; Rice, Daniel H

    2016-06-01

    As the most prevalent bacterial cause of human gastroenteritis, food-borne Campylobacter infections pose a serious threat to public health. Whole Genome Sequencing (WGS) is a tool providing quick and inexpensive approaches for analysis of food-borne pathogen epidemics. Here we report the WGS and annotation of a Campylobacter coli strain, FNW20G12, which was isolated from milk in the United States in 1997 and carries multidrug resistance. The draft genome of FNW20G12 (DDBJ/ENA/GenBank accession number LWIH00000000) contains 1, 855,435 bp (GC content 31.4%) with 1902 annotated coding regions, 48 RNAs and resistance to aminoglycoside, beta-lactams, tetracycline, as well as fluoroquinolones. There are very few genome reports of C. coli from dairy products with multidrug resistance. Here the draft genome of FNW20G12, a C. coli strain isolated from raw milk, is presented to aid in the epidemiology study of C. coli antimicrobial resistance and role in foodborne outbreak. PMID:27257607

  19. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation

    SciTech Connect

    Mavromatis, K; Land, Miriam L; Brettin, Thomas S; Quest, Daniel J; Copeland, A; Clum, Alicia; Goodwin, Lynne A.; Woyke, Tanja; Lapidus, Alla L.; Klenk, Hans-Peter; Cottingham, Robert W; Kyrpides, Nikos C

    2012-01-01

    Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).

  20. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  1. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    PubMed

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  2. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    PubMed

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  3. Sequencing and annotated analysis of full genome of Holstein breed bull.

    PubMed

    Kõks, Sulev; Reimann, Ene; Lilleoja, Rutt; Lättekivi, Freddy; Salumets, Andres; Reemann, Paula; Jaakma, Ülle

    2014-08-01

    In the present study, we describe the deep sequencing and structural analysis of the Holstein breed bull genome. Our aim was to receive a high-quality Holstein bull genome reference sequence and to describe different types of variations in its genome compared to Hereford breed as a reference. We generated four mate-paired libraries and one fragment library from 30 μg of genomic DNA. Colour space fasta were mapped and paired to the reference cow (Bos taurus) genome assembly from Oct. 2011 (Baylor 4.6.1/bosTau7). Initial sequencing resulted in the 4,864,054,296 of 50-bp reads. Average mapping efficiency was 71.7 % and altogether 3,494,534,136 reads and 157,928,163,086 bp were successfully mapped, resulting in 60 × coverage. This is the highest coverage for bovine genome published so far. Tertiary analysis found 6,362,988 SNPs in the bull's genome, 4,045,889 heterozygous and 2,317,099 homozygous variants. Annotation revealed that 4,330,337 of all discovered SNPs were annotated in the dbSNP database (build 137) and therefore 2,032,651 SNPs were novel. Large indel variations accounted for the 245,947,845 bp of the variation in entire genome and their number was 312,879. We also found that small indels (number was 633,310) accounted for the total variation of 2,542,552 nucleotides in the genome. Only 106,768 small indels were listed in the dbSNP. Finally, we identified 2,758 inversions in the genome of the bull covering in total 23,099,054 bp of genome's variation. The largest inversion was 87,440 bp in size. In conclusion, the present study discovered different types of novel variants in bull's genome after high-coverage sequencing. Better knowledge of the functions of these variations is needed.

  4. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing

  5. Discovery and annotation of small proteins using genomics, proteomics and computational approaches

    SciTech Connect

    Yang, Xiaohan; Tschaplinski, Timothy J.; Hurst, Gregory B.; Jawdy, Sara; Abraham, Paul E.; Lankford, Patricia K.; Adams, Rachel M.; Shah, Manesh B.; Hettich, Robert L.; Lindquist, Erika; Kalluri, Udaya C.; Gunter, Lee E.; Pennacchio, Christa; Tuskan, Gerald A.

    2011-03-02

    Small proteins (10 200 amino acids aa in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained 2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) codingpotential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  6. An integrated approach for genome annotation of the eukaryotic thermophile Chaetomium thermophilum

    PubMed Central

    Bock, Thomas; Chen, Wei-Hua; Ori, Alessandro; Malik, Nayab; Silva-Martin, Noella; Huerta-Cepas, Jaime; Powell, Sean T.; Kastritis, Panagiotis L.; Smyshlyaev, Georgy; Vonkova, Ivana; Kirkpatrick, Joanna; Doerks, Tobias; Nesme, Leo; Baßler, Jochen; Kos, Martin; Hurt, Ed; Carlomagno, Teresa; Gavin, Anne-Claude; Barabas, Orsolya; Müller, Christoph W.; van Noort, Vera; Beck, Martin; Bork, Peer

    2014-01-01

    The thermophilic fungus Chaetomium thermophilum holds great promise for structural biology. To increase the efficiency of its biochemical and structural characterization and to explore its thermophilic properties beyond those of individual proteins, we obtained transcriptomics and proteomics data, and integrated them with computational annotation methods and a multitude of biochemical experiments conducted by the structural biology community. We considerably improved the genome annotation of Chaetomium thermophilum and characterized the transcripts and expression of thousands of genes. We furthermore show that the composition and structure of the expressed proteome of Chaetomium thermophilum is similar to its mesophilic relatives. Data were deposited in a publicly available repository and provide a rich source to the structural biology community. PMID:25398899

  7. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  8. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns

    PubMed Central

    Christie, Karen R.; Hong, Eurie L.; Cherry, J. Michael

    2011-01-01

    The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions. PMID:19577472

  9. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    PubMed

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established.

  10. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    PubMed

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established. PMID:27453069

  11. The SOFG Anatomy Entry List (SAEL): An Annotation Tool for Functional Genomics Data

    PubMed Central

    Parkinson, Helen; Aitken, Stuart; Baldock, Richard A.; Bard, Jonathan B. L.; Burger, Albert; Hayamizu, Terry F.; Rector, Alan; Ringwald, Martin; Rogers, Jeremy; Rosse, Cornelius; Stoeckert, Christian J.

    2004-01-01

    A great deal of data in functional genomics studies needs to be annotated with low-resolution anatomical terms. For example, gene expression assays based on manually dissected samples (microarray, SAGE, etc.) need high-level anatomical terms to describe sample origin. First-pass annotation in high-throughput assays (e.g. large-scale in situ gene expression screens or phenotype screens) and bibliographic applications, such as selection of keywords, would also benefit from a minimum set of standard anatomical terms. Although only simple terms are required, the researcher faces serious practical problems of inconsistency and confusion, given the different aims and the range of complexity of existing anatomy ontologies. A Standards and Ontologies for Functional Genomics (SOFG) group therefore initiated discussions between several of the major anatomical ontologies for higher vertebrates. As we report here, one result of these discussions is a simple, accessible, controlled vocabulary of gross anatomical terms, the SOFG Anatomy Entry List (SAEL). The SAEL is available from http://www.sofg.org and is intended as a resource for biologists, curators, bioinformaticians and developers of software supporting functional genomics. It can be used directly for annotation in the contexts described above. Importantly, each term is linked to the corresponding term in each of the major anatomy ontologies. Where the simple list does not provide enough detail or sophistication, therefore, the researcher can use the SAEL to choose the appropriate ontology and move directly to the relevant term as an entry point. The SAEL links will also be used to support computational access to the respective ontologies. PMID:18629134

  12. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-12-01

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.

  13. ISsaga is an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes.

    PubMed

    Varani, Alessandro M; Siguier, Patricia; Gourbeyre, Edith; Charneau, Vincent; Chandler, Mick

    2011-01-01

    Insertion sequences (ISs) play a key role in prokaryotic genome evolution but are seldom well annotated. We describe a web application pipeline, ISsaga (http://issaga.biotoul.fr/ISsaga/issaga_index.php), that provides computational tools and methods for high-quality IS annotation. It uses established ISfinder annotation standards and permits rapid processing of single or multiple prokaryote genomes. ISsaga provides general prediction and annotation tools, information on genome context of individual ISs and a graphical overview of IS distribution around the genome of interest.

  14. The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

    PubMed Central

    Yu, Chenggang; Zavaljevski, Nela; Desai, Valmik; Johnson, Seth; Stevens, Fred J; Reifman, Jaques

    2008-01-01

    low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used. Conclusion The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources. PMID:18221520

  15. SNP annotation-based whole genomic prediction and selection: an application to feed efficiency and its component traits in pigs.

    PubMed

    Do, D N; Janss, L L G; Jensen, J; Kadarmideen, H N

    2015-05-01

    The study investigated genetic architecture and predictive ability using genomic annotation of residual feed intake (RFI) and its component traits (daily feed intake [DFI], ADG, and back fat [BF]). A total of 1,272 Duroc pigs had both genotypic and phenotypic records, and the records were split into a training (968 pigs) and a validation dataset (304 pigs) by assigning records as before and after January 1, 2012, respectively. SNP were annotated by 14 different classes using Ensembl variant effect prediction. Predictive accuracy and prediction bias were calculated using Bayesian Power LASSO, Bayesian A, B, and Cπ, and genomic BLUP (GBLUP) methods. Predictive accuracy ranged from 0.508 to 0.531, 0.506 to 0.532, 0.276 to 0.357, and 0.308 to 0.362 for DFI, RFI, ADG, and BF, respectively. BayesCπ100.1 increased accuracy slightly compared to the GBLUP model and other methods. The contribution per SNP to total genomic variance was similar among annotated classes across different traits. Predictive performance of SNP classes did not significantly differ from randomized SNP groups. Genomic prediction has accuracy comparable to observed phenotype, and use of genomic prediction can be cost effective by replacing feed intake measurement. Genomic annotation had less impact on predictive accuracy traits considered here but may be different for other traits. It is the first study to provide useful insights into biological classes of SNP driving the whole genomic prediction for complex traits in pigs.

  16. Integrative annotation of variants from 1092 humans: application to cancer genomics.

    PubMed

    Khurana, Ekta; Fu, Yao; Colonna, Vincenza; Mu, Xinmeng Jasmine; Kang, Hyun Min; Lappalainen, Tuuli; Sboner, Andrea; Lochovsky, Lucas; Chen, Jieming; Harmanci, Arif; Das, Jishnu; Abyzov, Alexej; Balasubramanian, Suganthi; Beal, Kathryn; Chakravarty, Dimple; Challis, Daniel; Chen, Yuan; Clarke, Declan; Clarke, Laura; Cunningham, Fiona; Evani, Uday S; Flicek, Paul; Fragoza, Robert; Garrison, Erik; Gibbs, Richard; Gümüs, Zeynep H; Herrero, Javier; Kitabayashi, Naoki; Kong, Yong; Lage, Kasper; Liluashvili, Vaja; Lipkin, Steven M; MacArthur, Daniel G; Marth, Gabor; Muzny, Donna; Pers, Tune H; Ritchie, Graham R S; Rosenfeld, Jeffrey A; Sisu, Cristina; Wei, Xiaomu; Wilson, Michael; Xue, Yali; Yu, Fuli; Dermitzakis, Emmanouil T; Yu, Haiyuan; Rubin, Mark A; Tyler-Smith, Chris; Gerstein, Mark

    2013-10-01

    Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers. PMID:24092746

  17. Integrative annotation of variants from 1092 humans: application to cancer genomics.

    PubMed

    Khurana, Ekta; Fu, Yao; Colonna, Vincenza; Mu, Xinmeng Jasmine; Kang, Hyun Min; Lappalainen, Tuuli; Sboner, Andrea; Lochovsky, Lucas; Chen, Jieming; Harmanci, Arif; Das, Jishnu; Abyzov, Alexej; Balasubramanian, Suganthi; Beal, Kathryn; Chakravarty, Dimple; Challis, Daniel; Chen, Yuan; Clarke, Declan; Clarke, Laura; Cunningham, Fiona; Evani, Uday S; Flicek, Paul; Fragoza, Robert; Garrison, Erik; Gibbs, Richard; Gümüs, Zeynep H; Herrero, Javier; Kitabayashi, Naoki; Kong, Yong; Lage, Kasper; Liluashvili, Vaja; Lipkin, Steven M; MacArthur, Daniel G; Marth, Gabor; Muzny, Donna; Pers, Tune H; Ritchie, Graham R S; Rosenfeld, Jeffrey A; Sisu, Cristina; Wei, Xiaomu; Wilson, Michael; Xue, Yali; Yu, Fuli; Dermitzakis, Emmanouil T; Yu, Haiyuan; Rubin, Mark A; Tyler-Smith, Chris; Gerstein, Mark

    2013-10-01

    Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

  18. New local potential useful for genome annotation and 3D modeling

    SciTech Connect

    Chandonia, John-Marc; Cohen, Fred E.

    2003-07-17

    A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar (but not identical) to the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10 percent more structures correctly identified as the most likely structural match in a fold library, and 20 percent more structures correctly narrowed down to a set of five possible candidates. JThread also significantly improves the average sequence alignment accuracy, from 53 percent to 62 percent of residues correctly aligned. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.

  19. Phydbac (phylogenomic display of bacterial genes): An interactive resource for the annotation of bacterial genomes.

    PubMed

    Enault, François; Suhre, Karsten; Poirot, Olivier; Abergel, Chantal; Claverie, Jean-Michel

    2003-07-01

    Phydbac is a web interactive resource based on phylogenomic profiling, designed to help microbiologists to annotate bacterial proteins. Phylogenomic annotation is based on the assumption that functionally linked protein-coding genes must evolve in a coordinated manner. The detection of subsets of co-evolving genes within a given genome involves the computation of protein sequence conservation profiles across a spectrum of microbial species, followed by the identification of significant pairwise correlations between them. Many ongoing studies are devoted to the problem of computing the most biologically significant phylogenomic profiles and how best identifying clusters of 'functionally interacting' genes. Here we introduce a web tool, Phydbac, allowing the dynamic construction of phylogenomic profiles of protein sequences of interest and their interactive display. In addition, Phydbac can identify Escherichia coli proteins exhibiting the evolution pattern most similar to arbitrary query protein sequences, hence providing functional hints for open reading frames (ORFs) of hypothetical or unknown function. The phylogenomic profiles of all E.coli K-12 protein-coding genes are pre-computed, allowing queries about E.coli genes to be answered instantaneously. The profiles and phylogenomic neighborhoods are computed using an original method shown to perform better than previous ones. An extension of Phydbac, including precomputed profiles for all available bacterial genomes (including major pathogens) will soon be available. Phydbac can be accessed at: http://igs-server.cnrs-mrs.fr/phydbac/.

  20. Construction and Annotation of a High Density SNP Linkage Map of the Atlantic Salmon (Salmo salar) Genome.

    PubMed

    Tsai, Hsin Y; Robledo, Diego; Lowe, Natalie R; Bekaert, Michael; Taggart, John B; Bron, James E; Houston, Ross D

    2016-07-07

    High density linkage maps are useful tools for fine-scale mapping of quantitative trait loci, and characterization of the recombination landscape of a species' genome. Genomic resources for Atlantic salmon (Salmo salar) include a well-assembled reference genome, and high density single nucleotide polymorphism (SNP) arrays. Our aim was to create a high density linkage map, and to align it with the reference genome assembly. Over 96,000 SNPs were mapped and ordered on the 29 salmon linkage groups using a pedigreed population comprising 622 fish from 60 nuclear families, all genotyped with the 'ssalar01' high density SNP array. The number of SNPs per group showed a high positive correlation with physical chromosome length (r = 0.95). While the order of markers on the genetic and physical maps was generally consistent, areas of discrepancy were identified. Approximately 6.5% of the previously unmapped reference genome sequence was assigned to chromosomes using the linkage map. Male recombination rate was lower than females across the vast majority of the genome, but with a notable peak in subtelomeric regions. Finally, using RNA-Seq data to annotate the reference genome, the mapped SNPs were categorized according to their predicted function, including annotation of ∼2500 putative nonsynonymous variants. The highest density SNP linkage map for any salmonid species has been created, annotated, and integrated with the Atlantic salmon reference genome assembly. This map highlights the marked heterochiasmy of salmon, and provides a useful resource for salmonid genetics and genomics research.

  1. Decoding the oak genome: public release of sequence data, assembly, annotation and publication strategies.

    PubMed

    Plomion, Christophe; Aury, Jean-Marc; Amselem, Joëlle; Alaeitabar, Tina; Barbe, Valérie; Belser, Caroline; Bergès, Hélène; Bodénès, Catherine; Boudet, Nathalie; Boury, Christophe; Canaguier, Aurélie; Couloux, Arnaud; Da Silva, Corinne; Duplessis, Sébastien; Ehrenmann, François; Estrada-Mairey, Barbara; Fouteau, Stéphanie; Francillonne, Nicolas; Gaspin, Christine; Guichard, Cécile; Klopp, Christophe; Labadie, Karine; Lalanne, Céline; Le Clainche, Isabelle; Leplé, Jean-Charles; Le Provost, Grégoire; Leroy, Thibault; Lesur, Isabelle; Martin, Francis; Mercier, Jonathan; Michotey, Célia; Murat, Florent; Salin, Franck; Steinbach, Delphine; Faivre-Rampant, Patricia; Wincker, Patrick; Salse, Jérôme; Quesneville, Hadi; Kremer, Antoine

    2016-01-01

    The 1.5 Gbp/2C genome of pedunculate oak (Quercus robur) has been sequenced. A strategy was established for dealing with the challenges imposed by the sequencing of such a large, complex and highly heterozygous genome by a whole-genome shotgun (WGS) approach, without the use of costly and time-consuming methods, such as fosmid or BAC clone-based hierarchical sequencing methods. The sequencing strategy combined short and long reads. Over 49 million reads provided by Roche 454 GS-FLX technology were assembled into contigs and combined with shorter Illumina sequence reads from paired-end and mate-pair libraries of different insert sizes, to build scaffolds. Errors were corrected and gaps filled with Illumina paired-end reads and contaminants detected, resulting in a total of 17,910 scaffolds (>2 kb) corresponding to 1.34 Gb. Fifty per cent of the assembly was accounted for by 1468 scaffolds (N50 of 260 kb). Initial comparison with the phylogenetically related Prunus persica gene model indicated that genes for 84.6% of the proteins present in peach (mean protein coverage of 90.5%) were present in our assembly. The second and third steps in this project are genome annotation and the assignment of scaffolds to the oak genetic linkage map. In accordance with the Bermuda and Fort Lauderdale agreements and the more recent Toronto Statement, the oak genome data have been released into public sequence repositories in advance of publication. In this presubmission paper, the oak genome consortium describes its principal lines of work and future directions for analyses of the nature, function and evolution of the oak genome. PMID:25944057

  2. Decoding the oak genome: public release of sequence data, assembly, annotation and publication strategies.

    PubMed

    Plomion, Christophe; Aury, Jean-Marc; Amselem, Joëlle; Alaeitabar, Tina; Barbe, Valérie; Belser, Caroline; Bergès, Hélène; Bodénès, Catherine; Boudet, Nathalie; Boury, Christophe; Canaguier, Aurélie; Couloux, Arnaud; Da Silva, Corinne; Duplessis, Sébastien; Ehrenmann, François; Estrada-Mairey, Barbara; Fouteau, Stéphanie; Francillonne, Nicolas; Gaspin, Christine; Guichard, Cécile; Klopp, Christophe; Labadie, Karine; Lalanne, Céline; Le Clainche, Isabelle; Leplé, Jean-Charles; Le Provost, Grégoire; Leroy, Thibault; Lesur, Isabelle; Martin, Francis; Mercier, Jonathan; Michotey, Célia; Murat, Florent; Salin, Franck; Steinbach, Delphine; Faivre-Rampant, Patricia; Wincker, Patrick; Salse, Jérôme; Quesneville, Hadi; Kremer, Antoine

    2016-01-01

    The 1.5 Gbp/2C genome of pedunculate oak (Quercus robur) has been sequenced. A strategy was established for dealing with the challenges imposed by the sequencing of such a large, complex and highly heterozygous genome by a whole-genome shotgun (WGS) approach, without the use of costly and time-consuming methods, such as fosmid or BAC clone-based hierarchical sequencing methods. The sequencing strategy combined short and long reads. Over 49 million reads provided by Roche 454 GS-FLX technology were assembled into contigs and combined with shorter Illumina sequence reads from paired-end and mate-pair libraries of different insert sizes, to build scaffolds. Errors were corrected and gaps filled with Illumina paired-end reads and contaminants detected, resulting in a total of 17,910 scaffolds (>2 kb) corresponding to 1.34 Gb. Fifty per cent of the assembly was accounted for by 1468 scaffolds (N50 of 260 kb). Initial comparison with the phylogenetically related Prunus persica gene model indicated that genes for 84.6% of the proteins present in peach (mean protein coverage of 90.5%) were present in our assembly. The second and third steps in this project are genome annotation and the assignment of scaffolds to the oak genetic linkage map. In accordance with the Bermuda and Fort Lauderdale agreements and the more recent Toronto Statement, the oak genome data have been released into public sequence repositories in advance of publication. In this presubmission paper, the oak genome consortium describes its principal lines of work and future directions for analyses of the nature, function and evolution of the oak genome.

  3. Plant Gene and Alternatively Spliced Variant Annotator. A plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species.

    PubMed

    Chen, Feng-Chi; Wang, Sheng-Shun; Chaw, Shu-Miaw; Huang, Yao-Ting; Chuang, Trees-Juen

    2007-03-01

    The completion of the rice (Oryza sativa) genome draft has brought unprecedented opportunities for genomic studies of the world's most important food crop. Previous rice gene annotations have relied mainly on ab initio methods, which usually yield a high rate of false-positive predictions and give only limited information regarding alternative splicing in rice genes. Comparative approaches based on expressed sequence tags (ESTs) can compensate for the drawbacks of ab initio methods because they can simultaneously identify experimental data-supported genes and alternatively spliced transcripts. Furthermore, cross-species EST information can be used to not only offset the insufficiency of same-species ESTs but also derive evolutionary implications. In this study, we used ESTs from seven plant species, rice, wheat (Triticum aestivum), maize (Zea mays), barley (Hordeum vulgare), sorghum (Sorghum bicolor), soybean (Glycine max), and Arabidopsis (Arabidopsis thaliana), to annotate the rice genome. We developed a plant genome annotation pipeline, Plant Gene and Alternatively Spliced Variant Annotator (PGAA). Using this approach, we identified 852 genes (931 isoforms) not annotated in other widely used databases (i.e. the Institute for Genomic Research, National Center for Biotechnology Information, and Rice Annotation Project) and found 87% of them supported by both rice and nonrice EST evidence. PGAA also identified more than 44,000 alternatively spliced events, of which approximately 20% are not observed in the other three annotations. These novel annotations represent rich opportunities for rice genome research, because the functions of most of our annotated genes are currently unknown. Also, in the PGAA annotation, the isoforms with non-rice-EST-supported exons are significantly enriched in transporter activity but significantly underrepresented in transcription regulator activity. We have also identified potential lineage-specific and conserved isoforms, which are

  4. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches

    SciTech Connect

    Yang, Xiaohan; Tschaplinski, Timothy J; Hurst, Gregory {Greg} B; Jawdy, Sara; Abraham, Paul E; Lankford, Patricia K; Adams, Rachel M; Shah, Manesh B; Hettich, Robert {Bob} L; Kalluri, Udaya C; Gunter, Lee E; Pennacchio, Christa; Tuskan, Gerald A

    2011-01-01

    Small proteins (10 200 amino acids (AA) in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained ~2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 AA in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: 1) coding-potential prediction, 2) evolutionary conservation between P. deltoides and other plant species, and 3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1,469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1,282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were also supported by proteomics data. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  5. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    PubMed Central

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  6. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome

    PubMed Central

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A. V.

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license. PMID:26393794

  7. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome.

    PubMed

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A V

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license. PMID:26393794

  8. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  9. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    PubMed

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  10. The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes.

    PubMed

    Schmid, K J; Aquadro, C F

    2001-10-01

    In genome projects of eukaryotic model organisms, a large number of novel genes of unknown function and evolutionary history ("orphans") are being identified. Since many orphans have no known homologs in distant species, it is unclear whether they are restricted to certain taxa or evolve rapidly, either because of a lack of constraints or positive Darwinian selection. Here we use three criteria for the selection of putatively rapidly evolving genes from a single sequence of Drosophila melanogaster. Thirteen candidate genes were chosen from the Adh region on the second chromosome and 1 from the tip of the X chromosome. We succeeded in obtaining sequence from 6 of these in the closely related species D. simulans and D. yakuba. Only 1 of the 6 genes showed a large number of amino acid replacements and in-frame insertions/deletions. A population survey of this gene suggests that its rapid evolution is due to the fixation of many neutral or nearly neutral mutations. Two other genes showed "normal" levels of divergence between species. Four genes had insertions/deletions that destroy the putative reading frame within exons, suggesting that these exons have been incorrectly annotated. The evolutionary analysis of orphan genes in closely related species is useful for the identification of both rapidly evolving and incorrectly annotated genes.

  11. PeerGAD: a peer-review-based and community-centric web application for viewing and annotating prokaryotic genome sequences

    PubMed Central

    D’Ascenzo, Mark D.; Collmer, Alan; Martin, Gregory B.

    2004-01-01

    PeerGAD is a web-based database-driven application that allows community-wide peer-reviewed annotation of prokaryotic genome sequences. The application was developed to support the annotation of the Pseudomonas syringae pv. tomato strain DC3000 genome sequence and is easily portable to other genome sequence annotation projects. PeerGAD incorporates several innovative design and operation features and accepts annotations pertaining to gene naming, role classification, gene translation and annotation derivation. The annotator tool in PeerGAD is built around a genome browser that offers users the ability to search and navigate the genome sequence. Because the application encourages annotation of the genome sequence directly by researchers and relies on peer review, it circumvents the need for an annotation curator while providing added value to the annotation data. Support for the Gene Ontology™ vocabulary, a structured and controlled vocabulary used in classification of gene roles, is emphasized throughout the system. Here we present the underlying concepts integral to the functionality of PeerGAD. PMID:15184545

  12. An innovative plant genomics and gene annotation program for high school, community college, and university faculty.

    PubMed

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E Bruce; Micklos, David A

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day "Plant Genomics and Gene Annotation" workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation.

  13. Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping.

    PubMed

    Zhu, Wei; Schlueter, Shannon D; Brendel, Volker

    2003-06-01

    Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.

  14. AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences.

    PubMed

    Grau, Jan; Reschke, Maik; Erkes, Annett; Streubel, Jana; Morgan, Richard D; Wilson, Geoffrey G; Koebnik, Ralf; Boch, Jens

    2016-01-01

    Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present 'AnnoTALE', a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities. PMID:26876161

  15. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus

    PubMed Central

    Weiss, Andy; Broach, William H.; Wiemels, Richard E.; Mogen, Austin B.; Rice, Kelly C.

    2016-01-01

    ABSTRACT In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. PMID:26861020

  16. Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans

    PubMed Central

    Oden, Steve; Brocchieri, Luciano

    2015-01-01

    Motivation: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale. Results: We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures. The methods were implemented in the N-Profile Analysis Computational Tool (NPACT), which highlights in graphical representations inconsistencies between newly identified ORFs and pre-existing annotations of coding-regions. We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions. Availability and implementation: NPACT is available as a web-based service and for download at http://genome.ufl.edu/npact. Contact: lucianob@ufl.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26048600

  17. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots.

    PubMed

    Kumar, Sujai; Jones, Martin; Koutsovoulos, Georgios; Clarke, Michael; Blaxter, Mark

    2013-01-01

    Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.

  18. Discovery of germline-related genes in Cephalochordate amphioxus: A genome wide survey using genome annotation and transcriptome data.

    PubMed

    Yue, Jia-Xing; Li, Kun-Lung; Yu, Jr-Kai

    2015-12-01

    The generation of germline cells is a critical process in the reproduction of multicellular organisms. Studies in animal models have identified a common repertoire of genes that play essential roles in primordial germ cell (PGC) formation. However, comparative studies also indicate that the timing and regulation of this core genetic program vary considerably in different animals, raising the intriguing questions regarding the evolution of PGC developmental mechanisms in metazoans. Cephalochordates (commonly called amphioxus or lancelets) represent one of the invertebrate chordate groups and can provide important information about the evolution of developmental mechanisms in the chordate lineage. In this study, we used genome and transcriptome data to identify germline-related genes in two distantly related cephalochordate species, Branchiostoma floridae and Asymmetron lucayanum. Branchiostoma and Asymmetron diverged more than 120 MYA, and the most conspicuous difference between them is their gonadal morphology. We used important germline developmental genes in several model animals to search the amphioxus genome and transcriptome dataset for conserved homologs. We also annotated the assembled transcriptome data using Gene Ontology (GO) terms to facilitate the discovery of putative genes associated with germ cell development and reproductive functions in amphioxus. We further confirmed the expression of 14 genes in developing oocytes or mature eggs using whole mount in situ hybridization, suggesting their potential functions in amphioxus germ cell development. The results of this global survey provide a useful resource for testing potential functions of candidate germline-related genes in cephalochordates and for investigating differences in gonad developmental mechanisms between Branchiostoma and Asymmetron species.

  19. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  20. Putative drug and vaccine target protein identification using comparative genomic analysis of KEGG annotated metabolic pathways of Mycoplasma hyopneumoniae.

    PubMed

    Damte, Dereje; Suh, Joo-Won; Lee, Seung-Jin; Yohannes, Sileshi Belew; Hossain, Md Akil; Park, Seung-Chun

    2013-07-01

    In the present study, a computational comparative and subtractive genomic/proteomic analysis aimed at the identification of putative therapeutic target and vaccine candidate proteins from Kyoto Encyclopedia of Genes and Genomes (KEGG) annotated metabolic pathways of Mycoplasma hyopneumoniae was performed for drug design and vaccine production pipelines against M.hyopneumoniae. The employed comparative genomic and metabolic pathway analysis with a predefined computational systemic workflow extracted a total of 41 annotated metabolic pathways from KEGG among which five were unique to M. hyopneumoniae. A total of 234 proteins were identified to be involved in these metabolic pathways. Although 125 non homologous and predicted essential proteins were found from the total that could serve as potential drug targets and vaccine candidates, additional prioritizing parameters characterize 21 proteins as vaccine candidate while druggability of each of the identified proteins evaluated by the DrugBank database prioritized 42 proteins suitable for drug targets.

  1. ChIP-Seq-Annotated Heliconius erato Genome Highlights Patterns of cis-Regulatory Evolution in Lepidoptera.

    PubMed

    Lewis, James J; van der Burg, Karin R L; Mazo-Vargas, Anyi; Reed, Robert D

    2016-09-13

    Uncovering phylogenetic patterns of cis-regulatory evolution remains a fundamental goal for evolutionary and developmental biology. Here, we characterize the evolution of regulatory loci in butterflies and moths using chromatin immunoprecipitation sequencing (ChIP-seq) annotation of regulatory elements across three stages of head development. In the process we provide a high-quality, functionally annotated genome assembly for the butterfly, Heliconius erato. Comparing cis-regulatory element conservation across six lepidopteran genomes, we find that regulatory sequences evolve at a pace similar to that of protein-coding regions. We also observe that elements active at multiple developmental stages are markedly more conserved than elements with stage-specific activity. Surprisingly, we also find that stage-specific proximal and distal regulatory elements evolve at nearly identical rates. Our study provides a benchmark for genome-wide patterns of regulatory element evolution in insects, and it shows that developmental timing of activity strongly predicts patterns of regulatory sequence evolution.

  2. Construction and Annotation of a High Density SNP Linkage Map of the Atlantic Salmon (Salmo salar) Genome

    PubMed Central

    Tsai, Hsin Y.; Robledo, Diego; Lowe, Natalie R.; Bekaert, Michael; Taggart, John B.; Bron, James E.; Houston, Ross D.

    2016-01-01

    High density linkage maps are useful tools for fine-scale mapping of quantitative trait loci, and characterization of the recombination landscape of a species’ genome. Genomic resources for Atlantic salmon (Salmo salar) include a well-assembled reference genome, and high density single nucleotide polymorphism (SNP) arrays. Our aim was to create a high density linkage map, and to align it with the reference genome assembly. Over 96,000 SNPs were mapped and ordered on the 29 salmon linkage groups using a pedigreed population comprising 622 fish from 60 nuclear families, all genotyped with the ‘ssalar01’ high density SNP array. The number of SNPs per group showed a high positive correlation with physical chromosome length (r = 0.95). While the order of markers on the genetic and physical maps was generally consistent, areas of discrepancy were identified. Approximately 6.5% of the previously unmapped reference genome sequence was assigned to chromosomes using the linkage map. Male recombination rate was lower than females across the vast majority of the genome, but with a notable peak in subtelomeric regions. Finally, using RNA-Seq data to annotate the reference genome, the mapped SNPs were categorized according to their predicted function, including annotation of ∼2500 putative nonsynonymous variants. The highest density SNP linkage map for any salmonid species has been created, annotated, and integrated with the Atlantic salmon reference genome assembly. This map highlights the marked heterochiasmy of salmon, and provides a useful resource for salmonid genetics and genomics research. PMID:27194803

  3. Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS

    PubMed Central

    Wang, Yunpeng; Thompson, Wesley K.; Schork, Andrew J.; Holland, Dominic; Chen, Chi-Hua; Bettella, Francesco; Desikan, Rahul S.; Li, Wen; Witoelar, Aree; Zuber, Verena; Devor, Anna; Nöthen, Markus M.; Rietschel, Marcella; Chen, Qiang; Werge, Thomas; Cichon, Sven; Weinberger, Daniel R.; Djurovic, Srdjan; O’Donovan, Michael; Visscher, Peter M.; Andreassen, Ole A.; Dale, Anders M.

    2016-01-01

    Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic (“z-score”) of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a “relative enrichment score” for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8). There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3. PMID:26808560

  4. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    PubMed

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  5. Gene prediction and annotation in Penstemon (Plantaginaceae): A workflow for marker development from extremely low-coverage genome sequencing1

    PubMed Central

    Blischak, Paul D.; Wenzel, Aaron J.; Wolfe, Andrea D.

    2014-01-01

    • Premise of the study: Penstemon (Plantaginaceae) is a large and diverse genus endemic to North America. However, determining the phylogenetic relationships among its 280 species has been difficult due to its recent evolutionary radiation. The development of a large, multilocus data set can help to resolve this challenge. • Methods: Using both previously sequenced genomic libraries and our own low-coverage whole-genome shotgun sequencing libraries, we used the MAKER2 Annotation Pipeline to identify gene regions for the development of sequencing loci from six extremely low-coverage Penstemon genomes (∼0.005×–0.007×). We also compared this approach to BLAST searches, and conducted analyses to characterize sequence divergence across the species sequenced. • Results: Annotations and gene predictions were successfully added to more than 10,000 contigs for potential use in downstream primer design. Primers were then designed for chloroplast, mitochondrial, and nuclear loci from these annotated sequences. MAKER2 identified longer gene regions in all six Penstemon genomes when compared with BLASTN and BLASTX searches. The average level of sequence divergence among the six species was 7.14%. • Discussion: Combining bioinformatics tools into a workflow that produces annotations can be useful for creating potential phylogenetic markers from thousands of sequences even when genome coverage is extremely low and reference data are only available from distant relatives. Furthermore, the output from MAKER2 contains information about important gene features, such as exon boundaries, and can be easily integrated with visualization tools to facilitate the process of marker development. PMID:25506519

  6. Comparative annotation of functional regions in the human genome using epigenomic data.

    PubMed

    Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

    2013-04-01

    Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type-specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes.

  7. The genome sequence of Leishmania (Leishmania) amazonensis: functional annotation and extended analysis of gene models.

    PubMed

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; Carmona e Ferreira, Renata; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães; Bahia, Diana

    2013-12-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3'-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

  8. The Genome Sequence of Leishmania (Leishmania) amazonensis: Functional Annotation and Extended Analysis of Gene Models

    PubMed Central

    Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Maurício Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Würtele, Martin; de Carvalho, Lucas Miguel; e Ferreira, Renata Carmona; Mortara, Renato Arruda; Barbiéri, Clara Lucia; Mieczkowski, Piotr; da Silveira, José Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Gonçalo Amarante Guimarães; Bahia, Diana

    2013-01-01

    We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3′-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

  9. Narrowing and genomic annotation of the commonly deleted region of the 5q- syndrome

    SciTech Connect

    Boultwood, Jacqueline; Fidler, Carrie; Strickson, Amanda J.; Watkins, Fiona; Gama, Susana; Kearney, Lyndal; Tosi, Sabrina; Kasprzyk, Arek; Cheng, Jan-Fang; Jaju, Rina J.; Wainscoat, James S.

    2002-01-15

    The 5q syndrome is the most distinct of the myelodysplastic syndromes, and the molecular basis for this disorder remains unknown. We describe the narrowing of the common deleted region (CDR) of the 5q syndrome to the approximately 1.5-megabases interval at 5q32 flanked by D5S413 and the GLRA1 gene. The Ensemblgene prediction program has been used for the complete genomic annotation of the CDR. The CDR is gene rich and contains 24 known genes and 16 novel (predicted) genes. Of 40 genes in the CDR, 33 are expressed in CD34 cells and, therefore, represent candidate genes since they are expressed within the hematopoietic stem/progenitor cell compartment. A number of the genes assigned to the CDR represent good candidates for the 5q syndrome, including MEGF1, G3BP, and several of the novel gene predictions. These data now afford a comprehensive mutational/expression analysis of all candidate genes assigned to the CDR.

  10. Discovery of germline-related genes in Cephalochordate amphioxus: A genome wide survey using genome annotation and transcriptome data.

    PubMed

    Yue, Jia-Xing; Li, Kun-Lung; Yu, Jr-Kai

    2015-12-01

    The generation of germline cells is a critical process in the reproduction of multicellular organisms. Studies in animal models have identified a common repertoire of genes that play essential roles in primordial germ cell (PGC) formation. However, comparative studies also indicate that the timing and regulation of this core genetic program vary considerably in different animals, raising the intriguing questions regarding the evolution of PGC developmental mechanisms in metazoans. Cephalochordates (commonly called amphioxus or lancelets) represent one of the invertebrate chordate groups and can provide important information about the evolution of developmental mechanisms in the chordate lineage. In this study, we used genome and transcriptome data to identify germline-related genes in two distantly related cephalochordate species, Branchiostoma floridae and Asymmetron lucayanum. Branchiostoma and Asymmetron diverged more than 120 MYA, and the most conspicuous difference between them is their gonadal morphology. We used important germline developmental genes in several model animals to search the amphioxus genome and transcriptome dataset for conserved homologs. We also annotated the assembled transcriptome data using Gene Ontology (GO) terms to facilitate the discovery of putative genes associated with germ cell development and reproductive functions in amphioxus. We further confirmed the expression of 14 genes in developing oocytes or mature eggs using whole mount in situ hybridization, suggesting their potential functions in amphioxus germ cell development. The results of this global survey provide a useful resource for testing potential functions of candidate germline-related genes in cephalochordates and for investigating differences in gonad developmental mechanisms between Branchiostoma and Asymmetron species. PMID:25847029

  11. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    SciTech Connect

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to

  12. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    DOE PAGES

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genesmore » and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary

  13. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models.

    PubMed

    Benedict, Matthew N; Mundy, Michael B; Henry, Christopher S; Chia, Nicholas; Price, Nathan D

    2014-10-01

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to

  14. Draft genome sequence and annotation of Lactobacillus acetotolerans BM-LA14527, a beer-spoilage bacteria.

    PubMed

    Liu, Junyan; Li, Lin; Peters, Brian M; Li, Bing; Deng, Yang; Xu, Zhenbo; Shirtliff, Mark E

    2016-09-01

    Lactobacillus acetotolerans is a hard-to-culture beer-spoilage bacterium capable of entering into the viable putative nonculturable (VPNC) state. As part of an initial strategy to investigate the phenotypic behavior of L. acetotolerans, draft genome sequencing was performed. Results demonstrated a total of 1824 predicted annotated genes, with several potential VPNC- and beer-spoilage-associated genes identified. Importantly, this is the first genome sequence of L. acetotolerans as beer-spoilage bacteria and it may aid in further analysis of L. acetotolerans and other beer-spoilage bacteria, with direct implications for food safety control in the beer brewing industry.

  15. Draft genome sequence and annotation of Lactobacillus acetotolerans BM-LA14527, a beer-spoilage bacteria.

    PubMed

    Liu, Junyan; Li, Lin; Peters, Brian M; Li, Bing; Deng, Yang; Xu, Zhenbo; Shirtliff, Mark E

    2016-09-01

    Lactobacillus acetotolerans is a hard-to-culture beer-spoilage bacterium capable of entering into the viable putative nonculturable (VPNC) state. As part of an initial strategy to investigate the phenotypic behavior of L. acetotolerans, draft genome sequencing was performed. Results demonstrated a total of 1824 predicted annotated genes, with several potential VPNC- and beer-spoilage-associated genes identified. Importantly, this is the first genome sequence of L. acetotolerans as beer-spoilage bacteria and it may aid in further analysis of L. acetotolerans and other beer-spoilage bacteria, with direct implications for food safety control in the beer brewing industry. PMID:27559043

  16. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

    PubMed Central

    2012-01-01

    Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas. PMID:23256920

  17. Generation, functional annotation and comparative analysis of black spruce (Picea mariana) ESTs: an important conifer genomic resource

    PubMed Central

    2013-01-01

    Background EST (expressed sequence tag) sequences and their annotation provide a highly valuable resource for gene discovery, genome sequence annotation, and other genomics studies that can be applied in genetics, breeding and conservation programs for non-model organisms. Conifers are long-lived plants that are ecologically and economically important globally, and have a large genome size. Black spruce (Picea mariana), is a transcontinental species of the North American boreal and temperate forests. However, there are limited transcriptomic and genomic resources for this species. The primary objective of our study was to develop a black spruce transcriptomic resource to facilitate on-going functional genomics projects related to growth and adaptation to climate change. Results We conducted bidirectional sequencing of cDNA clones from a standard cDNA library constructed from black spruce needle tissues. We obtained 4,594 high quality (2,455 5' end and 2,139 3' end) sequence reads, with an average read-length of 532 bp. Clustering and assembly of ESTs resulted in 2,731 unique sequences, consisting of 2,234 singletons and 497 contigs. Approximately two-thirds (63%) of unique sequences were functionally annotated. Genes involved in 36 molecular functions and 90 biological processes were discovered, including 24 putative transcription factors and 232 genes involved in photosynthesis. Most abundantly expressed transcripts were associated with photosynthesis, growth factors, stress and disease response, and transcription factors. A total of 216 full-length genes were identified. About 18% (493) of the transcripts were novel, representing an important addition to the Genbank EST database (dbEST). Fifty-seven di-, tri-, tetra- and penta-nucleotide simple sequence repeats were identified. Conclusions We have developed the first high quality EST resource for black spruce and identified 493 novel transcripts, which may be species-specific related to life history and

  18. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  19. Initiating the mollusk genomics annotation community: toward creating the complete curated gene-set of the Japanese Pearl Oyster, Pinctada fucata.

    PubMed

    Kawashima, Takeshi; Takeuchi, Takeshi; Koyanagi, Ryo; Kinoshita, Shigeharu; Endo, Hirotoshi; Endo, Kazuyoshi

    2013-10-01

    The genome sequence of the Japanese pearl oyster, the first draft genome from a mollusk, was published in February 2012. In order to curate the draft genome assemblies and annotate the predicted gene models, two annotation Jamborees were held in Okinawa and Tokyo. To date, 761 genes have been surveyed and curated. A preparatory meeting and a debriefing were held at the Misaki Marine Biological Station before and after the Jamborees. These four events, in conjunction with the sequence-decoding project, have facilitated the first series of gene annotations. Genome annotators among the Jamboree participants added 22 functional categories to the annotation system to date. Of these, 17 are included in Generic Gene Ontology. The other five categories are specific to molluskan biology, such as "Byssus Formation" and "Shell Formation", including Biomineralization and Acidic Proteins. A total of 731 genes from our latest version of gene models are annotated and classified into these 22 categories. The resulting data will serve as a useful reference for future genomic analyses of this species as well as comparative analyses among mollusks.

  20. ChIP-Seq-Annotated Heliconius erato Genome Highlights Patterns of cis-Regulatory Evolution in Lepidoptera.

    PubMed

    Lewis, James J; van der Burg, Karin R L; Mazo-Vargas, Anyi; Reed, Robert D

    2016-09-13

    Uncovering phylogenetic patterns of cis-regulatory evolution remains a fundamental goal for evolutionary and developmental biology. Here, we characterize the evolution of regulatory loci in butterflies and moths using chromatin immunoprecipitation sequencing (ChIP-seq) annotation of regulatory elements across three stages of head development. In the process we provide a high-quality, functionally annotated genome assembly for the butterfly, Heliconius erato. Comparing cis-regulatory element conservation across six lepidopteran genomes, we find that regulatory sequences evolve at a pace similar to that of protein-coding regions. We also observe that elements active at multiple developmental stages are markedly more conserved than elements with stage-specific activity. Surprisingly, we also find that stage-specific proximal and distal regulatory elements evolve at nearly identical rates. Our study provides a benchmark for genome-wide patterns of regulatory element evolution in insects, and it shows that developmental timing of activity strongly predicts patterns of regulatory sequence evolution. PMID:27626657

  1. On genome annotation of Brucellaphage Gadvasu (BpG): discovery of ORFans for integrated systems biology approaches.

    PubMed

    Chachra, Deepti; Kaur, Pushpinder; Siddavatam, Prasad; Suravajhala, Prashanth; Saxena, Hari Mohan

    2015-12-01

    Brucellaphage Gadvasu (BpG) is a lytic phage infecting Brucella spp. Brucellaphages contain dsDNA as genetic material and are short-tailed particles with host-specificity. Here, we report the challenges on annotation in the complete genome sequence of BpG when compared with that of a recent broad host-range brucellaphage Pr, an original reference genome. The extracted DNA was subjected to genome sequencing with Illumina technology and assembled using SSAKE/Velvet. A significant number of genes were found to be similar between the phages with sequence analysis revealing conserved open reading frames that correspond to 33 gene ontology classifiers, transcriptional terminators and a few putative transcriptional promoters. The analyses revealed that the genome constitutes 1269 contigs and 275 genes encoding 260 proteins. The sequence comparison from the reference data indicated that the genome shares an approximately 70 % nucleotide similarity and differs mainly in the region encoding proteins. We bring this commentary providing an overview of how this exemplar genome can allow us to understand these known unknown regions in brucellaphages.

  2. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database

    PubMed Central

    Engel, Stacia R.; Cherry, J. Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186

  3. SNPMeta: SNP annotation and SNP metadata collection without a reference genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increase in availability of resequencing data is greatly accelerating SNP discovery and has facilitated the development of SNP genotyping assays. This, in turn, is increasing interest in annotation of individual SNPs. Currently, these data are only available through curation, or comparison to a ...

  4. KSHV 2.0: A Comprehensive Annotation of the Kaposi's Sarcoma-Associated Herpesvirus Genome Using Next-Generation Sequencing Reveals Novel Genomic and Functional Features

    PubMed Central

    Arias, Carolina; Weisburd, Ben; Stern-Ginossar, Noam; Mercier, Alexandre; Madrid, Alexis S.; Bellare, Priya; Holdorf, Meghan; Weissman, Jonathan S.; Ganem, Don

    2014-01-01

    Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs), and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus. PMID:24453964

  5. Comparative Life Cycle Transcriptomics Revises Leishmania mexicana Genome Annotation and Links a Chromosome Duplication with Parasitism of Vertebrates

    PubMed Central

    Fiebig, Michael; Kelly, Steven; Gluenz, Eva

    2015-01-01

    Leishmania spp. are protozoan parasites that have two principal life cycle stages: the motile promastigote forms that live in the alimentary tract of the sandfly and the amastigote forms, which are adapted to survive and replicate in the harsh conditions of the phagolysosome of mammalian macrophages. Here, we used Illumina sequencing of poly-A selected RNA to characterise and compare the transcriptomes of L. mexicana promastigotes, axenic amastigotes and intracellular amastigotes. These data allowed the production of the first transcriptome evidence-based annotation of gene models for this species, including genome-wide mapping of trans-splice sites and poly-A addition sites. The revised genome annotation encompassed 9,169 protein-coding genes including 936 novel genes as well as modifications to previously existing gene models. Comparative analysis of gene expression across promastigote and amastigote forms revealed that 3,832 genes are differentially expressed between promastigotes and intracellular amastigotes. A large proportion of genes that were downregulated during differentiation to amastigotes were associated with the function of the motile flagellum. In contrast, those genes that were upregulated included cell surface proteins, transporters, peptidases and many uncharacterized genes, including 293 of the 936 novel genes. Genome-wide distribution analysis of the differentially expressed genes revealed that the tetraploid chromosome 30 is highly enriched for genes that were upregulated in amastigotes, providing the first evidence of a link between this whole chromosome duplication event and adaptation to the vertebrate host in this group. Peptide evidence for 42 proteins encoded by novel transcripts supports the idea of an as yet uncharacterised set of small proteins in Leishmania spp. with possible implications for host-pathogen interactions. PMID:26452044

  6. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  7. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL:http://rgd.mcw.edu.

  8. The Disease Portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database

    PubMed Central

    Hayman, G. Thomas; Laulederkind, Stanley J. F.; Smith, Jennifer R.; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene–disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL: http://rgd.mcw.edu PMID:27009807

  9. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    ERIC Educational Resources Information Center

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day…

  10. VibrioBase: a model for next-generation genome and annotation database development.

    PubMed

    Choo, Siew Woh; Heydari, Hamed; Tan, Tze King; Siow, Cheuk Chuen; Beh, Ching Yew; Wee, Wei Yee; Mutha, Naresh V R; Wong, Guat Jah; Ang, Mia Yang; Yazdi, Amir Hessam

    2014-01-01

    To facilitate the ongoing research of Vibrio spp., a dedicated platform for the Vibrio research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. We present VibrioBase, a useful resource platform, providing all basic features of a sequence database with the addition of unique analysis tools which could be valuable for the Vibrio research community. VibrioBase currently houses a total of 252 Vibrio genomes developed in a user-friendly manner and useful to enable the analysis of these genomic data, particularly in the field of comparative genomics. Besides general data browsing features, VibrioBase offers analysis tools such as BLAST interfaces and JBrowse genome browser. Other important features of this platform include our newly developed in-house tools, the pairwise genome comparison (PGC) tool, and pathogenomics profiling tool (PathoProT). The PGC tool is useful in the identification and comparative analysis of two genomes, whereas PathoProT is designed for comparative pathogenomics analysis of Vibrio strains. Both of these tools will enable researchers with little experience in bioinformatics to get meaningful information from Vibrio genomes with ease. We have tested the validity and suitability of these tools and features for use in the next-generation database development.

  11. Carbohydrate catabolic flexibility in the mammalian intestinal commensal Lactobacillus ruminis revealed by fermentation studies aligned to genome annotations

    PubMed Central

    2011-01-01

    Background Lactobacillus ruminis is a poorly characterized member of the Lactobacillus salivarius clade that is part of the intestinal microbiota of pigs, humans and other mammals. Its variable abundance in human and animals may be linked to historical changes over time and geographical differences in dietary intake of complex carbohydrates. Results In this study, we investigated the ability of nine L. ruminis strains of human and bovine origin to utilize fifty carbohydrates including simple sugars, oligosaccharides, and prebiotic polysaccharides. The growth patterns were compared with metabolic pathways predicted by annotation of a high quality draft genome sequence of ATCC 25644 (human isolate) and the complete genome of ATCC 27782 (bovine isolate). All of the strains tested utilized prebiotics including fructooligosaccharides (FOS), soybean-oligosaccharides (SOS) and 1,3:1,4-β-D-gluco-oligosaccharides to varying degrees. Six strains isolated from humans utilized FOS-enriched inulin, as well as FOS. In contrast, three strains isolated from cows grew poorly in FOS-supplemented medium. In general, carbohydrate utilisation patterns were strain-dependent and also varied depending on the degree of polymerisation or complexity of structure. Six putative operons were identified in the genome of the human isolate ATCC 25644 for the transport and utilisation of the prebiotics FOS, galacto-oligosaccharides (GOS), SOS, and 1,3:1,4-β-D-Gluco-oligosaccharides. One of these comprised a novel FOS utilisation operon with predicted capacity to degrade chicory-derived FOS. However, only three of these operons were identified in the ATCC 27782 genome that might account for the utilisation of only SOS and 1,3:1,4-β-D-Gluco-oligosaccharides. Conclusions This study has provided definitive genome-based evidence to support the fermentation patterns of nine strains of Lactobacillus ruminis, and has linked it to gene distribution patterns in strains from different sources. Furthermore

  12. DAS Writeback: A Collaborative Annotation System

    PubMed Central

    2011-01-01

    Background Centralised resources such as GenBank and UniProt are perfect examples of the major international efforts that have been made to integrate and share biological information. However, additional data that adds value to these resources needs a simple and rapid route to public access. The Distributed Annotation System (DAS) provides an adequate environment to integrate genomic and proteomic information from multiple sources, making this information accessible to the community. DAS offers a way to distribute and access information but it does not provide domain experts with the mechanisms to participate in the curation process of the available biological entities and their annotations. Results We designed and developed a Collaborative Annotation System for proteins called DAS Writeback. DAS writeback is a protocol extension of DAS to provide the functionalities of adding, editing and deleting annotations. We implemented this new specification as extensions of both a DAS server and a DAS client. The architecture was designed with the involvement of the DAS community and it was improved after performing usability experiments emulating a real annotation task. Conclusions We demonstrate that DAS Writeback is effective, usable and will provide the appropriate environment for the creation and evolution of community protein annotation. PMID:21569281

  13. Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning.

    PubMed

    Panwar, Bharat; Menon, Rajasree; Eksi, Ridvan; Li, Hong-Dong; Omenn, Gilbert S; Guan, Yuanfang

    2016-06-01

    The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level. In the present study, we used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. We used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. We demonstrated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. All predictions have been implemented in a web resource called "IsoFunc", which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc . PMID:27142340

  14. Reconciliation of Sequence Data and Updated Annotation of the Genome of Agrobacterium tumefaciens C58, and Distribution of a Linear Chromosome in the Genus Agrobacterium

    PubMed Central

    Slater, Steven; Setubal, João C.; Houmiel, Kathryn; Sun, Jian; Kaul, Rajinder; Goldman, Barry S.; Farrand, Stephen K.; Almeida, Nalvo; Burr, Thomas; Nester, Eugene; Rhoads, David M.; Kadoi, Ryosuke; Ostheimer, Trucian; Pride, Nicole; Sabo, Allison; Henry, Erin; Telepak, Erin; Cromes, Lindsey; Harkleroad, Alana; Oliphant, Louis; Pratt-Szegila, Phil; Welch, Roy; Wood, Derek

    2013-01-01

    Two groups independently sequenced the Agrobacterium tumefaciens C58 genome in 2001. We report here consolidation of these sequences, updated annotation, and additional analysis of the evolutionary history of the linear chromosome, which is apparently limited to the biovar I group of Agrobacterium. PMID:23241979

  15. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and D. barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica v. virgifera (16,747 bp) and D. barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and...

  16. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium.

    PubMed

    Williams, Baraka S; Isokpehi, Raphael D; Mbah, Andreas N; Hollman, Antoinesha L; Bernard, Christina O; Simmons, Shaneka S; Ayensu, Wellington K; Garner, Bianca L

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  17. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium

    PubMed Central

    Williams, Baraka S.; Isokpehi, Raphael D.; Mbah, Andreas N.; Hollman, Antoinesha L.; Bernard, Christina O.; Simmons, Shaneka S.; Ayensu, Wellington K.; Garner, Bianca L.

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  18. Genome sequencing and annotation of Acinetobacter gerneri strain MTCC 9824(T).

    PubMed

    Singh, Nitin Kumar; Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.4 Mb genome of Acinetobacter gerneri strain MTCC 9824(T). The genome has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S16S) and 64 aminoacyl-tRNA synthetase genes.

  19. Genome sequencing and annotation of Acinetobacter gyllenbergii strain MTCC 11365(T).

    PubMed

    Singh, Nitin Kumar; Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report 4.3 Mb genome of the Acinetobacter gyllenbergii strain MTCC 11365(T). The draft genome of A. gyllenbergii has a G + C content of 41.0% and includes 3 rRNA genes (5S, 23S, 16S) and 67 aminoacyl-tRNA synthetase genes.

  20. Genome sequencing and annotation of Acinetobacter haemolyticus strain MTCC 9819(T).

    PubMed

    Khatri, Indu; Singh, Nitin Kumar; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 3.4 Mb genome of Acinetobacter haemolyticus strain MTCC 9819(T). The genome has a G + C content of 40.0% and includes 3 rRNA genes (5S, 23S, 16S) and 65 aminoacyl-tRNA synthetase genes.

  1. Genome Annotation Provides Insight into Carbon Monoxide and Hydrogen Metabolism in Rubrivivax gelatinosus

    PubMed Central

    Wawrousek, Karen; Noble, Scott; Korlach, Jonas; Chen, Jin; Eckert, Carrie; Yu, Jianping; Maness, Pin-Ching

    2014-01-01

    We report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzyme complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth. PMID:25479613

  2. KEGG orthology-based annotation of the predicted proteome of Acropora digitifera: ZoophyteBase - an open access and searchable database of a coral genome

    PubMed Central

    2013-01-01

    Background Contemporary coral reef research has firmly established that a genomic approach is urgently needed to better understand the effects of anthropogenic environmental stress and global climate change on coral holobiont interactions. Here we present KEGG orthology-based annotation of the complete genome sequence of the scleractinian coral Acropora digitifera and provide the first comprehensive view of the genome of a reef-building coral by applying advanced bioinformatics. Description Sequences from the KEGG database of protein function were used to construct hidden Markov models. These models were used to search the predicted proteome of A. digitifera to establish complete genomic annotation. The annotated dataset is published in ZoophyteBase, an open access format with different options for searching the data. A particularly useful feature is the ability to use a Google-like search engine that links query words to protein attributes. We present features of the annotation that underpin the molecular structure of key processes of coral physiology that include (1) regulatory proteins of symbiosis, (2) planula and early developmental proteins, (3) neural messengers, receptors and sensory proteins, (4) calcification and Ca2+-signalling proteins, (5) plant-derived proteins, (6) proteins of nitrogen metabolism, (7) DNA repair proteins, (8) stress response proteins, (9) antioxidant and redox-protective proteins, (10) proteins of cellular apoptosis, (11) microbial symbioses and pathogenicity proteins, (12) proteins of viral pathogenicity, (13) toxins and venom, (14) proteins of the chemical defensome and (15) coral epigenetics. Conclusions We advocate that providing annotation in an open-access searchable database available to the public domain will give an unprecedented foundation to interrogate the fundamental molecular structure and interactions of coral symbiosis and allow critical questions to be addressed at the genomic level based on combined aspects of

  3. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis

    PubMed Central

    Buisine, Nicolas; Ruan, Xiaoan; Bilesimo, Patrice; Grimaldi, Alexis; Alfama, Gladys; Ariyaratne, Pramila; Mulawadi, Fabianus; Chen, Jieqi; Sung, Wing-Kin; Liu, Edison T.; Demeneix, Barbara A.; Ruan, Yijun; Sachs, Laurent M.

    2015-01-01

    Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10Kb, ~17Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome “fragmentation”, reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data. PMID:26348928

  4. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis.

    PubMed

    Buisine, Nicolas; Ruan, Xiaoan; Bilesimo, Patrice; Grimaldi, Alexis; Alfama, Gladys; Ariyaratne, Pramila; Mulawadi, Fabianus; Chen, Jieqi; Sung, Wing-Kin; Liu, Edison T; Demeneix, Barbara A; Ruan, Yijun; Sachs, Laurent M

    2015-01-01

    Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10 Kb, ~17 Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome "fragmentation", reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data.

  5. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis.

    PubMed

    Buisine, Nicolas; Ruan, Xiaoan; Bilesimo, Patrice; Grimaldi, Alexis; Alfama, Gladys; Ariyaratne, Pramila; Mulawadi, Fabianus; Chen, Jieqi; Sung, Wing-Kin; Liu, Edison T; Demeneix, Barbara A; Ruan, Yijun; Sachs, Laurent M

    2015-01-01

    Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10 Kb, ~17 Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome "fragmentation", reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data. PMID:26348928

  6. Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle

    PubMed Central

    Zacher, Benedikt; Lidschreiber, Michael; Cramer, Patrick; Gagneur, Julien; Tresch, Achim

    2014-01-01

    DNA replication, transcription and repair involve the recruitment of protein complexes that change their composition as they progress along the genome in a directed or strand-specific manner. Chromatin immunoprecipitation in conjunction with hidden Markov models (HMMs) has been instrumental in understanding these processes, as they segment the genome into discrete states that can be related to DNA-associated protein complexes. However, current HMM-based approaches are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g., RNA expression) with non-strand-specific (e.g., ChIP) data, which is indispensable to accurately characterize directed processes. To overcome these limitations, we introduce bidirectional HMMs which infer directed genomic states from occupancy profiles de novo. Application to RNA polymerase II-associated factors in yeast and chromatin modifications in human T cells recovers the majority of transcribed loci, reveals gene-specific variations in the yeast transcription cycle and indicates the existence of directed chromatin state patterns at transcribed, but not at repressed, regions in the human genome. In yeast, we identify 32 new transcribed loci, a regulated initiation–elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination. We anticipate bidirectional HMMs to significantly improve the analyses of genome-associated directed processes. PMID:25527639

  7. Genome annotation past, present, and future: how to define an ORF at each locus.

    PubMed

    Brent, Michael R

    2005-12-01

    Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every gene--the first plateau on the long climb toward a comprehensive catalog. These strategies--sequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curation--will have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0. PMID:16339376

  8. Next-Generation High-Throughput Functional Annotation of Microbial Genomes

    PubMed Central

    Baric, Ralph S.; Damania, Blossom; Miller, Samuel I.; Rubin, Eric J.

    2016-01-01

    ABSTRACT Host infection by microbial pathogens cues global changes in microbial and host cell biology that facilitate microbial replication and disease. The complete maps of thousands of bacterial and viral genomes have recently been defined; however, the rate at which physiological or biochemical functions have been assigned to genes has greatly lagged. The National Institute of Allergy and Infectious Diseases (NIAID) addressed this gap by creating functional genomics centers dedicated to developing high-throughput approaches to assign gene function. These centers require broad-based and collaborative research programs to generate and integrate diverse data to achieve a comprehensive understanding of microbial pathogenesis. High-throughput functional genomics can lead to new therapeutics and better understanding of the next generation of emerging pathogens by rapidly defining new general mechanisms by which organisms cause disease and replicate in host tissues and by facilitating the rate at which functional data reach the scientific community. PMID:27703071

  9. Genome annotation provides insight into carbon monoxide and hydrogen metabolism in Rubrivivax gelatinosus

    DOE PAGES

    Wawrousek, Karen; Noble, Scott; Korlach, Jonas; Chen, Jin; Eckert, Carrie; Yu, Jianping; Maness, Pin -Ching

    2014-12-05

    In this article, we report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzymemore » complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Finally and collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth.« less

  10. Whole genome sequences and annotation of Micrococcus luteus SUBG006, a novel phytopathogen of mango.

    PubMed

    Rakhashiya, Purvi M; Patel, Pooja P; Thaker, Vrinda S

    2015-12-01

    Actinobaceria, Micrococcus luteus SUBG006 was isolated from infected leaves of Mangifera indica L. vr. Nylon in Rajkot, (22.30°N, 70.78°E), Gujarat, India. The genome size is 3.86 Mb with G + C content of 69.80% and contains 112 rRNA sequences (5S, 16S and 23S). The whole genome sequencing has been deposited in DDBJ/EMBL/GenBank under the accession number JOKP00000000. PMID:26697318

  11. Genome sequencing and annotation of Acinetobacter guillouiae strain MSP 4-18.

    PubMed

    Singh, Nitin Kumar; Khatri, Indu; Subramanian, Srikrishna; Mayilraj, Shanmugam

    2014-12-01

    The genus Acinetobacter consists of 31 validly published species ubiquitously distributed in nature and primarily associated with nosocomial infection. We report the 4.8 Mb genome of Acinetobacter guillouiae MSP 4-18, isolated from a mangrove soil sample from Parangipettai (11°30'N, 79°47'E), Tamil Nadu, India. The draft genome of A. guillouiae MSP 4-18 has a G + C content of 38.0% and includes 3 rRNA genes (5S, 23S, 16S) and 69 aminoacyl-tRNA synthetase genes.

  12. Efficient assembly and annotation of the transcriptome of catfish by RNA-seq analysis of a doubled haploid homozygote

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriousl...

  13. The Complete Genome Sequence and Updated Annotation of Desulfovibrio alaskensis G20

    SciTech Connect

    Hauser, Loren John; Wall, Judy D.; Brown, Steven D; Land, Miriam L; Bruce, David; Detter, J. Chris; Frank, Larimer; Goodwin, Lynne A.; Han, Cliff; Lapidus, Alla L.; Nolan, Matt; Palumbo, Anthony Vito; Pitluck, Samual; Keller, Kimberly L; Rapp-Giles, Barbara J; Price, Morgan N.; Lin, Monica; Tapia, Roxanne; Copeland, A; Cheng, Jan-Fang

    2011-01-01

    Desulfovibrio alaskensis G20 (formerly desulfuricans G20) is a Gram-negative mesophilic sulfate-reducing bacterium (SRB), known to corrode ferrous metals and to reduce toxic radionuclides and metals such as uranium and chromium to sparingly soluble and less toxic forms. We present the 3.7 Mb genome sequence to provide insights into its physiology.

  14. Complete Genome Sequence and Updated Annotation of Desulfovibrio alaskensis G20

    DOE PAGES

    Hauser, Loren J.; Land, Miriam L.; Brown, Steven D.; Larimer, Frank L; Keller, Kimberly L.; Rapp-Giles, Barbara J.; Price, Morgan N.; Lin, Monica A.; Bruce, David C.; Detter, John C.; et al

    2011-06-17

    Desulfovibrio alaskensis G20 (formerly desulfuricans G20) is a Gram-negative mesophilic sulfate-reducing bacterium (SRB), known to corrode ferrous metals and to reduce toxic radionuclides and metals such as uranium and chromium to sparingly soluble and less toxic forms. We present the 3.7 Mb genome sequence to provide insights into its physiology.

  15. Genome sequence and annotation of Trichoderma parareesei, the ancestor of the cellulase producer Trichoderma reesei

    DOE PAGES

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; Karimi, Aghcheh Razieh; Atanasova, Lea; Chenthamara, Komal; Baker, Scott E.; Zhang, Ruifu; Shen, Qirong; Freitag, Michael; et al

    2015-08-13

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins.

  16. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Cancer.gov

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  17. Mapping and annotating obesity-related genes in pig and human genomes.

    PubMed

    Martelli, Pier Luigi; Fontanesi, Luca; Piovesan, Damiano; Fariselli, Piero; Casadio, Rita

    2014-01-01

    Background. Obesity is a major health problem in both developed and emerging countries. Obesity is a complex disease whose etiology involves genetic factors in strong interplay with environmental determinants and lifestyle. The discovery of genetic factors and biological pathways underlying human obesity is hampered by the difficulty in controlling the genetic background of human cohorts. Animal models are then necessary to further dissect the genetics of obesity. Pig has emerged as one of the most attractive models, because of the similarity with humans in the mechanisms regulating the fat deposition. Results. We collected the genes related to obesity in humans and to fat deposition traits in pig. We localized them on both human and pig genomes, building a map useful to interpret comparative studies on obesity. We characterized the collected genes structurally and functionally with BAR+ and mapped them on KEGG pathways and on STRING protein interaction network. Conclusions. The collected set consists of 361 obesity related genes in human and pig genomes. All genes were mapped on the human genome, and 54 could not be localized on the pig genome (release 2012). Only for 3 human genes there is no counterpart in pig, confirming that this animal is a good model for human obesity studies. Obesity related genes are mostly involved in regulation and signaling processes/pathways and relevant connection emerges between obesity-related genes and diseases such as cancer and infectious diseases.

  18. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    SciTech Connect

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S; Cottingham, Robert W

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to

  19. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions

    PubMed Central

    Han, Ying; Hazelett, Dennis J.; Wiklund, Fredrik; Schumacher, Fredrick R.; Stram, Daniel O.; Berndt, Sonja I.; Wang, Zhaoming; Rand, Kristin A.; Hoover, Robert N.; Machiela, Mitchell J.; Yeager, Merideth; Burdette, Laurie; Chung, Charles C.; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C.; Key, Timothy J.; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L.; Kolb, Suzanne; Gapstur, Susan M.; Diver, W. Ryan; Stevens, Victoria L.; Strom, Sara S.; Pettaway, Curtis A.; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Yeboah, Edward D.; Tettey, Yao; Biritwum, Richard B.; Adjei, Andrew A.; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P.; Isaacs, William B.; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L.; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M.; Ingles, Sue A.; Kittles, Rick A.; Murphy, Adam B.; Blot, William J.; Signorello, Lisa B.; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M. Cristina; Wu, Suh-Yuh; Hennis, Anselm J. M.; Rybicki, Benjamin A.; Neslund-Dudas, Christine; Hsing, Ann W.; Chu, Lisa; Goodman, Phyllis J.; Klein, Eric A.; Zheng, S. Lilly; Witte, John S.; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L.; Hunter, David J.; Gronberg, Henrik; Cook, Michael B.; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J.; Easton, Douglas F.; Henderson, Brian E.; Coetzee, Gerhard A.; Conti, David V.; Haiman, Christopher A.

    2015-01-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10−4–5.6 × 10−3) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10−6) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  20. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    PubMed

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation.

  1. Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics.

    PubMed

    Lundby, Alicia; Rossin, Elizabeth J; Steffensen, Annette B; Acha, Moshe Rav; Newton-Cheh, Christopher; Pfeufer, Arne; Lynch, Stacey N; Olesen, Søren-Peter; Brunak, Søren; Ellinor, Patrick T; Jukema, J Wouter; Trompet, Stella; Ford, Ian; Macfarlane, Peter W; Krijthe, Bouwe P; Hofman, Albert; Uitterlinden, André G; Stricker, Bruno H; Nathoe, Hendrik M; Spiering, Wilko; Daly, Mark J; Asselbergs, Folkert W; van der Harst, Pim; Milan, David J; de Bakker, Paul I W; Lage, Kasper; Olsen, Jesper V

    2014-08-01

    Genome-wide association studies (GWAS) have identified thousands of loci associated with complex traits, but it is challenging to pinpoint causal genes in these loci and to exploit subtle association signals. We used tissue-specific quantitative interaction proteomics to map a network of five genes involved in the Mendelian disorder long QT syndrome (LQTS). We integrated the LQTS network with GWAS loci from the corresponding common complex trait, QT-interval variation, to identify candidate genes that were subsequently confirmed in Xenopus laevis oocytes and zebrafish. We used the LQTS protein network to filter weak GWAS signals by identifying single-nucleotide polymorphisms (SNPs) in proximity to genes in the network supported by strong proteomic evidence. Three SNPs passing this filter reached genome-wide significance after replication genotyping. Overall, we present a general strategy to propose candidates in GWAS loci for functional studies and to systematically filter subtle association signals using tissue-specific quantitative interaction proteomics. PMID:24952909

  2. Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics

    PubMed Central

    Lundby, Alicia; Rossin, Elizabeth J.; Steffensen, Annette B.; Rav Acha, Moshe; Newton-Cheh, Christopher; Pfeufer, Arne; Lynch, Stacey N.; Olesen, Søren-Peter; Brunak, Søren; Ellinor, Patrick T.; Jukema, J.Wouter; Trompet, Stella; Ford, Ian; Macfarlane, Peter W.; Krijthe, Bouwe P.; Hofman, Albert; Uitterlinden, Andre G.; Stricker, Bruno H.; Nathoe, Hendrik M.; Spiering, Wilko; Daly, Mark J.; Asselbergs, Folkert W.; van der Harst, Pim; Milan, David J.; de Bakker, Paul I.W.; Lage, Kasper; Olsen, Jesper V.

    2014-01-01

    Genome-wide association studies (GWAS) have identified thousands of loci associated wtih complex traits, but it is challenging to pinpoint causal genes in these loci and to exploit subtle association signals. We used tissue-specific quantitative interaction proteomics to map a network of five genes involved in the Mendelian disorder long QT syndrome (LQTS). We integrated the LQTS network with GWAS loci from the corresponding common complex trait, QT interval variation, to identify candidate genes that were subsequently confirmed in Xenopus laevis oocytes and zebrafish. We used the LQTS protein network to filter weak GWAS signals by identifying single nucleotide polymorphisms (SNPs) in proximity to genes in the network supported by strong proteomic evidence. Three SNPs passing this filter reached genome-wide significance after replication genotyping. Overall, we present a general strategy to propose candidates in GWAS loci for functional studies and to systematically filter subtle association signals using tissue-specific quantitative interaction proteomics. PMID:24952909

  3. Partitioning heritability by functional annotation using genome-wide association summary statistics.

    PubMed

    Finucane, Hilary K; Bulik-Sullivan, Brendan; Gusev, Alexander; Trynka, Gosia; Reshef, Yakir; Loh, Po-Ru; Anttila, Verneri; Xu, Han; Zang, Chongzhi; Farh, Kyle; Ripke, Stephan; Day, Felix R; Purcell, Shaun; Stahl, Eli; Lindstrom, Sara; Perry, John R B; Okada, Yukinori; Raychaudhuri, Soumya; Daly, Mark J; Patterson, Nick; Neale, Benjamin M; Price, Alkes L

    2015-11-01

    Recent work has demonstrated that some functional categories of the genome contribute disproportionately to the heritability of complex diseases. Here we analyze a broad set of functional elements, including cell type-specific elements, to estimate their polygenic contributions to heritability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD score regression, for partitioning heritability from GWAS summary statistics while accounting for linked markers. This new method is computationally tractable at very large sample sizes and leverages genome-wide information. Our findings include a large enrichment of heritability in conserved regions across many traits, a very large immunological disease-specific enrichment of heritability in FANTOM5 enhancers and many cell type-specific enrichments, including significant enrichment of central nervous system cell types in the heritability of body mass index, age at menarche, educational attainment and smoking behavior. PMID:26414678

  4. Comparisons of Shewanella strains based on genome annotations, modeling and experiments

    SciTech Connect

    Ong, Wai Kit; Vu, Trang; Lovendahl, Klaus N.; Llull, Jenna; Serres, Margaret; Romine, Margaret F.; Reed, Jennifer L.

    2014-01-01

    Shewanella is a genus of facultatively anaerobic, Gram-negative bacteria that have highly adaptable metabolism which allows them to thrive in diverse environments. This quality makes them attractive target bacteria for research in bioremediation and microbial fuel cell applications. Constraint-based modeling is a useful tool for helping researchers gain insights into the metabolic capabilities of these bacteria. However, Shewanella oneidensis MR-1 is the only strain with a genome-scale metabolic model constructed out of the 22 sequenced Shewanella strains.

  5. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    PubMed Central

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and

  6. GELBANK : A database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes.

    SciTech Connect

    Babnigg, G.; Giometti, C. S.; Biosciences Division

    2004-01-01

    GELBANK is a publicly available database of two-dimensional gel electrophoresis (2DE) gel patterns of proteomes from organisms with known genome information (available at and ftp://bioinformatics.anl.gov/gelbank/). Currently it includes 131 completed, mostly microbial proteomes available from the National Center for Biotechnology Information. A web interface allows the upload of 2D gel patterns and their annotation for registered users. The images are organized by species, tissue type, separation method, sample type and staining method. The database can be queried based on protein or 2DE-pattern attributes. A web interface allows registered users to assign molecular weight and pH gradient profiles to their own 2D gel patterns as well as to link protein identifications to a given spot on the pattern. The website presents all of the submitted 2D gel patterns where the end-user can dynamically display the images or parts of images along with molecular weight, pH profile information and linked protein identification. A collection of images can be selected for the creation of animations from which the user can select sub-regions of interest and unlimited 2D gel patterns for visualization. The website currently presents 233 identifications for 81 gel patterns for Homo sapiens, Methanococcus jannaschii, Pyro coccus furiosus, Shewanella oneidensis, Escherichia coli and Deinococcus radiodurans.

  7. Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors.

    PubMed

    Flower, Darren R; Attwood, Teresa K

    2004-12-01

    G protein-coupled receptors (GPCR) are amongst the best studied and most functionally diverse types of cell-surface protein. The importance of GPCRs as mediates or cell function and organismal developmental underlies their involvement in key physiological roles and their prominence as targets for pharmacological therapeutics. In this review, we highlight the requirement for integrated protocols which underline the different perspectives offered by different sequence analysis methods. BLAST and FastA offer broad brush strokes. Motif-based search methods add the fine detail. Structural modelling offers another perspective which allows us to elucidate the physicochemical properties that underlie ligand binding. Together, these different views provide a more informative and a more detailed picture of GPCR structure and function. Many GPCRs remain orphan receptors with no identified ligand, yet as computer-driven functional genomics starts to elaborate their functions, a new understanding of their roles in cell and developmental biology will follow. PMID:15561589

  8. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  9. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

    PubMed

    Gaulton, Kyle J; Ferreira, Teresa; Lee, Yeji; Raimondo, Anne; Mägi, Reedik; Reschen, Michael E; Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Gonçalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Bottinger, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex S F; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kao, Wen-Hong L; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian'an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Peltonen, Leena; Perry, John R B; Platou, Carl G P; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wiltshire, Steven; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth J F; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöcke, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin N A; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O'Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-12-01

    We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  10. Genetic fine-mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci

    PubMed Central

    Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Goncalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Borringer, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex SF; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian’an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Perry, John RB; Platou, Carl GP; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth JF; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöckel, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin NA; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O’Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-01-01

    We performed fine-mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in/near KCNQ1. “Credible sets” of variants most likely to drive each distinct signal mapped predominantly to non-coding sequence, implying that T2D association is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine-mapping implicated rs10830963 as driving T2D association. We confirmed that this T2D-risk allele increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D-risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  11. Functional annotation of HOT regions in the human genome: implications for human disease and cancer.

    PubMed

    Li, Hao; Chen, Hebing; Liu, Feng; Ren, Chao; Wang, Shengqi; Bo, Xiaochen; Shu, Wenjie

    2015-06-26

    Advances in genome-wide association studies (GWAS) and large-scale sequencing studies have resulted in an impressive and growing list of disease- and trait-associated genetic variants. Most studies have emphasised the discovery of genetic variation in coding sequences, however, the noncoding regulatory effects responsible for human disease and cancer biology have been substantially understudied. To better characterise the cis-regulatory effects of noncoding variation, we performed a comprehensive analysis of the genetic variants in HOT (high-occupancy target) regions, which are considered to be one of the most intriguing findings of recent large-scale sequencing studies. We observed that GWAS variants that map to HOT regions undergo a substantial net decrease and illustrate development-specific localisation during haematopoiesis. Additionally, genetic risk variants are disproportionally enriched in HOT regions compared with LOT (low-occupancy target) regions in both disease-relevant and cancer cells. Importantly, this enrichment is biased toward disease- or cancer-specific cell types. Furthermore, we observed that cancer cells generally acquire cancer-specific HOT regions at oncogenes through diverse mechanisms of cancer pathogenesis. Collectively, our findings demonstrate the key roles of HOT regions in human disease and cancer and represent a critical step toward further understanding disease biology, diagnosis, and therapy.

  12. CrAgDb--a database of annotated chaperone repertoire in archaeal genomes.

    PubMed

    Rani, Shikha; Srivastava, Abhishikha; Kumar, Manish; Goel, Manisha

    2016-03-01

    Chaperones are a diverse class of ubiquitous proteins that assist other cellular proteins in folding correctly and maintaining their native structure. Many different chaperones cooperate to constitute the 'proteostasis' machinery in the cells. It has been proposed earlier that archaeal organisms could be ideal model systems for deciphering the basic functioning of the 'protein folding machinery' in higher eukaryotes. Several chaperone families have been characterized in archaea over the years but mostly one protein at a time, making it difficult to decipher the composition and mechanistics of the protein folding system as a whole. In order to deal with these lacunae, we have developed a database of all archaeal chaperone proteins, CrAgDb (Chaperone repertoire in Archaeal genomes). The data have been presented in a systematic way with intuitive browse and search facilities for easy retrieval of information. Access to these curated datasets should expedite large-scale analysis of archaeal chaperone networks and significantly advance our understanding of operation and regulation of the protein folding machinery in archaea. Researchers could then translate this knowledge to comprehend the more complex protein folding pathways in eukaryotic systems. The database is freely available at http://14.139.227.92/mkumar/cragdb/. PMID:26862144

  13. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

    PubMed

    Gaulton, Kyle J; Ferreira, Teresa; Lee, Yeji; Raimondo, Anne; Mägi, Reedik; Reschen, Michael E; Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Gonçalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Bottinger, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex S F; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kao, Wen-Hong L; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian'an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Peltonen, Leena; Perry, John R B; Platou, Carl G P; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wiltshire, Steven; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth J F; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöcke, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin N A; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O'Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-12-01

    We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease.

  14. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

  15. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG

  16. Algal functional annotation tool

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations tomore » interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on

  17. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

    PubMed

    Arloth, Janine; Bader, Daniel M; Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer's annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2.

  18. An integrated pipeline for next generation sequencing and annotation of the complete mitochondrial genome of the giant intestinal fluke, Fasciolopsis buski (Lankester, 1857) Looss, 1899.

    PubMed

    Biswal, Devendra Kumar; Ghatani, Sudeep; Shylla, Jollin A; Sahu, Ranjana; Mullapudi, Nandita; Bhattacharya, Alok; Tandon, Veena

    2013-01-01

    Helminths include both parasitic nematodes (roundworms) and platyhelminths (trematode and cestode flatworms) that are abundant, and are of clinical importance. The genetic characterization of parasitic flatworms using advanced molecular tools is central to the diagnosis and control of infections. Although the nuclear genome houses suitable genetic markers (e.g., in ribosomal (r) DNA) for species identification and molecular characterization, the mitochondrial (mt) genome consistently provides a rich source of novel markers for informative systematics and epidemiological studies. In the last decade, there have been some important advances in mtDNA genomics of helminths, especially lung flukes, liver flukes and intestinal flukes. Fasciolopsis buski, often called the giant intestinal fluke, is one of the largest digenean trematodes infecting humans and found primarily in Asia, in particular the Indian subcontinent. Next-generation sequencing (NGS) technologies now provide opportunities for high throughput sequencing, assembly and annotation within a short span of time. Herein, we describe a high-throughput sequencing and bioinformatics pipeline for mt genomics for F. buski that emphasizes the utility of short read NGS platforms such as Ion Torrent and Illumina in successfully sequencing and assembling the mt genome using innovative approaches for PCR primer design as well as assembly. We took advantage of our NGS whole genome sequence data (unpublished so far) for F. buski and its comparison with available data for the Fasciola hepatica mtDNA as the reference genome for design of precise and specific primers for amplification of mt genome sequences from F. buski. A long-range PCR was carried out to create an NGS library enriched in mt DNA sequences. Two different NGS platforms were employed for complete sequencing, assembly and annotation of the F. buski mt genome. The complete mt genome sequences of the intestinal fluke comprise 14,118 bp and is thus the shortest

  19. The Ensembl gene annotation system.

    PubMed

    Aken, Bronwen L; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J; Murphy, Daniel N; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul; Searle, Stephen M J

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html. PMID:27337980

  20. The Ensembl gene annotation system

    PubMed Central

    Aken, Bronwen L.; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J.; Murphy, Daniel N.; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y. Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html PMID:27337980

  1. The Vigna Genome Server, 'VigGS': A Genomic Knowledge Base of the Genus Vigna Based on High-Quality, Annotated Genome Sequence of the Azuki Bean, Vigna angularis (Willd.) Ohwi & Ohashi.

    PubMed

    Sakai, Hiroaki; Naito, Ken; Takahashi, Yu; Sato, Toshiyuki; Yamamoto, Toshiya; Muto, Isamu; Itoh, Takeshi; Tomooka, Norihiko

    2016-01-01

    The genus Vigna includes legume crops such as cowpea, mungbean and azuki bean, as well as >100 wild species. A number of the wild species are highly tolerant to severe environmental conditions including high-salinity, acid or alkaline soil; drought; flooding; and pests and diseases. These features of the genus Vigna make it a good target for investigation of genetic diversity in adaptation to stressful environments; however, a lack of genomic information has hindered such research in this genus. Here, we present a genome database of the genus Vigna, Vigna Genome Server ('VigGS', http://viggs.dna.affrc.go.jp), based on the recently sequenced azuki bean genome, which incorporates annotated exon-intron structures, along with evidence for transcripts and proteins, visualized in GBrowse. VigGS also facilitates user construction of multiple alignments between azuki bean genes and those of six related dicot species. In addition, the database displays sequence polymorphisms between azuki bean and its wild relatives and enables users to design primer sequences targeting any variant site. VigGS offers a simple keyword search in addition to sequence similarity searches using BLAST and BLAT. To incorporate up to date genomic information, VigGS automatically receives newly deposited mRNA sequences of pre-set species from the public database once a week. Users can refer to not only gene structures mapped on the azuki bean genome on GBrowse but also relevant literature of the genes. VigGS will contribute to genomic research into plant biotic and abiotic stresses and to the future development of new stress-tolerant crops. PMID:26644460

  2. Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  3. Genomic research, publics and experts in Latin America: Nation, race and body.

    PubMed

    Wade, Peter; López-Beltrán, Carlos; Restrepo, Eduardo; Santos, Ricardo Ventura

    2015-12-01

    The articles in this issue highlight contributions that studies of Latin America can make to wider debates about the effects of genomic science on public ideas about race and nation. We argue that current ideas about the power of genomics to transfigure and transform existing ways of thinking about human diversity are often overstated. If a range of social contexts are examined, the effects are uneven. Our data show that genomic knowledge can unsettle and reinforce ideas of nation and race; it can be both banal and highly politicized. In this introduction, we outline concepts of genetic knowledge in society; theories of genetics, nation and race; approaches to public understandings of science; and the Latin American contexts of transnational ideas of nation and race. PMID:27479996

  4. Genomic research, publics and experts in Latin America: Nation, race and body

    PubMed Central

    Wade, Peter; López-Beltrán, Carlos; Restrepo, Eduardo; Santos, Ricardo Ventura

    2015-01-01

    The articles in this issue highlight contributions that studies of Latin America can make to wider debates about the effects of genomic science on public ideas about race and nation. We argue that current ideas about the power of genomics to transfigure and transform existing ways of thinking about human diversity are often overstated. If a range of social contexts are examined, the effects are uneven. Our data show that genomic knowledge can unsettle and reinforce ideas of nation and race; it can be both banal and highly politicized. In this introduction, we outline concepts of genetic knowledge in society; theories of genetics, nation and race; approaches to public understandings of science; and the Latin American contexts of transnational ideas of nation and race. PMID:27479996

  5. Genome-wide genetic interaction analysis of glaucoma using expert knowledge derived from human phenotype networks.

    PubMed

    Hu, Ting; Darabos, Christian; Cricco, Maria E; Kong, Emily; Moore, Jason H

    2015-01-01

    The large volume of GWAS data poses great computational challenges for analyzing genetic interactions associated with common human diseases. We propose a computational framework for characterizing epistatic interactions among large sets of genetic attributes in GWAS data. We build the human phenotype network (HPN) and focus around a disease of interest. In this study, we use the GLAUGEN glaucoma GWAS dataset and apply the HPN as a biological knowledge-based filter to prioritize genetic variants. Then, we use the statistical epistasis network (SEN) to identify a significant connected network of pairwise epistatic interactions among the prioritized SNPs. These clearly highlight the complex genetic basis of glaucoma. Furthermore, we identify key SNPs by quantifying structural network characteristics. Through functional annotation of these key SNPs using Biofilter, a software accessing multiple publicly available human genetic data sources, we find supporting biomedical evidences linking glaucoma to an array of genetic diseases, proving our concept. We conclude by suggesting hypotheses for a better understanding of the disease.

  6. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (∼3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  7. Annotation of a hybrid partial genome of the coffee rust (Hemileia vastatrix) contributes to the gene repertoire catalog of the Pucciniales.

    PubMed

    Cristancho, Marco A; Botero-Rozo, David Octavio; Giraldo, William; Tabima, Javier; Riaño-Pachón, Diego Mauricio; Escobar, Carolina; Rozo, Yomara; Rivera, Luis F; Durán, Andrés; Restrepo, Silvia; Eilam, Tamar; Anikster, Yehoshua; Gaitán, Alvaro L

    2014-01-01

    Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333 Mb was built based on the 8 isolates; this assembly was used for subsequent analyses. Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3921 families were uncovered; a considerable proportion of the predicted proteins (73.8%) were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish races/isolates.

  8. Genome annotation of a 1.5 Mb region of human chromosome 6q23 encompassing a quantitative trait locus for fetal hemoglobin expression in adults

    PubMed Central

    Close, James; Game, Laurence; Clark, Barnaby; Bergounioux, Jean; Gerovassili, Ageliki; Thein, Swee Lay

    2004-01-01

    Background Heterocellular hereditary persistence of fetal hemoglobin (HPFH) is a common multifactorial trait characterized by a modest increase of fetal hemoglobin levels in adults. We previously localized a Quantitative Trait Locus for HPFH in an extensive Asian-Indian kindred to chromosome 6q23. As part of the strategy of positional cloning and a means towards identification of the specific genetic alteration in this family, a thorough annotation of the candidate interval based on a strategy of in silico / wet biology approach with comparative genomics was conducted. Results The ~1.5 Mb candidate region was shown to contain five protein-coding genes. We discovered a very large uncharacterized gene containing WD40 and SH3 domains (AHI1), and extended the annotation of four previously characterized genes (MYB, ALDH8A1, HBS1L and PDE7B). We also identified several genes that do not appear to be protein coding, and generated 17 kb of novel transcript sequence data from re-sequencing 97 EST clones. Conclusion Detailed and thorough annotation of this 1.5 Mb interval in 6q confirms a high level of aberrant transcripts in testicular tissue. The candidate interval was shown to exhibit an extraordinary level of alternate splicing – 19 transcripts were identified for the 5 protein coding genes, but it appears that a significant portion (14/19) of these alternate transcripts did not have an open reading frame, hence their functional role is questionable. These transcripts may result from aberrant rather than regulated splicing. PMID:15169551

  9. IMG 4 version of the integrated microbial genomes comparative analysis system.

    PubMed

    Markowitz, Victor M; Chen, I-Min A; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Woyke, Tanja; Huntemann, Marcel; Anderson, Iain; Billis, Konstantinos; Varghese, Neha; Mavromatis, Konstantinos; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C

    2014-01-01

    The Integrated Microbial Genomes (IMG) data warehouse integrates genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG provides tools for analyzing and reviewing the structural and functional annotations of genomes in a comparative context. IMG's data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG's annotation and data integration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu). PMID:24165883

  10. IMG 4 version of the integrated microbial genomes comparative analysis system

    SciTech Connect

    Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Woyke, Tanja; Huntemann, Marcel; Anderson, Iain; Billis, Konstantinos; Varghese, Neha; Mavromatis, Konstantinos; Pati, Amrita; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2013-10-27

    The Integrated Microbial Genomes (IMG) data warehouse integrates genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG provides tools for analyzing and reviewing the structural and functional annotations of genomes in a comparative context. IMG’s data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG’s annotation and data integration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Finally, different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu).

  11. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs.

    PubMed

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M

    2013-04-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci.

  12. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs.

    PubMed

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M

    2013-04-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621

  13. Genome Annotation by Shotgun Inactivation of a Native Gene in Hemizygous Cells: Application to BRCA2 with Implication of Hypomorphic Variants

    PubMed Central

    Ghosh, Soma; Bhunia, Anil K.; Paun, Bogdan C.; Gilbert, Samuel F.; Dhru, Urmil; Patel, Kalpesh; Kern, Scott E.

    2015-01-01

    The greatest interpretive challenge of modern medicine may be to functionally annotate the vast variation of human genomes. Demonstrating a proposed approach, we created a library of BRCA2 exon 27 shotgun-mutant plasmids including solitary and multiplex mutations to generate human knockin clones using homologous recombination. This 55-mutation, 13-clone syngeneic variance library (SyVaL) comprised severely affected clones having early-stop nonsense mutations, functionally hypomorphic clones having multiple missense mutations emphasizing the potential to identify and assess hypomorphic mutations in novel proteomic and epidemiologic studies, and neutral clones having multiple missense mutations. Efficient coverage of nonessential amino acids was provided by mutation multiplexing. Severe mutations were distinguished from hypomorphic or neutral changes by chemosensitivity assays (hypersensitivity to mitomycin C and acetaldehyde), by analysis of RAD51 focus formation, and by mitotic multipolarity. A multiplex unbiased approach of generating all-human SyVaLs in medically important genes, with random mutations in native genes, would provide databases of variants that could be functionally annotated without concerns arising from exogenous cDNA constructs or interspecies interactions, as a basis for subsequent proteomic domain mapping or clinical calibration if desired. Such gene-irrelevant approaches could be scaled up for multiple genes of clinical interest, providing distributable cellular libraries linked to public-shared functional databases. PMID:25451944

  14. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genome wide analysis of orthologous clusters is an important component of comparative genomics studies. Identifying the overlap among orthologous clusters can enable us to elucidate the function and evolution of proteins across multiple species. Here, we report a web platform named OrthoVenn that i...

  15. Algal functional annotation tool

    SciTech Connect

    Lopez, D.; Casero, D.; Cokus, S. J.; Merchant, S. S.; Pellegrini, M.

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  16. Assembly and annotation of full mitochondrial genomes for the corn rootworm species, Diabrotica virgifera virgifera and Diabrotica barberi (Insecta: Coleoptera: Chrysomelidae), using Next Generation Sequence data.

    PubMed

    Coates, Brad S

    2014-06-01

    Complete mitochondrial genomes for two corn rootworm species, Diabrotica virgifera virgifera (16,747 bp) and Diabrotica barberi (16,632; Insecta: Coleoptera: Chrysomelidae), were assembled from Illumina HiSeq2000 read data. Annotation indicated that the order and orientation of 13 protein coding genes (PCGs), and 22 tRNA and 2 rRNA sequences were in typical of insect mitochondrial genomes. Non-standard nad4 and cox3 stop codons were composed of single T nucleotides and likely completed by adenylation, and atypical TTT start codons was predicted for both D. v. virgifera and D. barberinad1 genes. The D. v. virgifera and D. barberi haplotypes showed 819 variable nucleotide positions within PCG regions (7.36% divergence), which suggest that speciation may have occurred ~3.68 million years ago assuming a linear rate of short-term substitution. Phylogenetic analyses of Coleopteran MtD genome show clustering based on family level, and may have the capacity to resolve the evolutionary history within this Order of insects. PMID:24657060

  17. Analysis of the multi-copied genes and the impact of the redundant protein coding sequences on gene annotation in prokaryotic genomes.

    PubMed

    Yu, Jia-Feng; Chen, Qing-Li; Ren, Jing; Yang, Yan-Ling; Wang, Ji-Hua; Sun, Xiao

    2015-07-01

    The important roles of duplicated genes in evolutional process have been recognized in bacteria, archaebacteria and eukaryotes, while there is very little study on the multi-copied protein coding genes that share sequence identity of 100%. In this paper, the multi-copied protein coding genes in a number of prokaryotic genomes are comprehensively analyzed firstly. The results show that 0-15.93% of the protein coding genes in each genome are multi-copied genes and 0-16.49% of the protein coding genes in each genome are highly similar with the sequence identity ≥ 80%. Function and COG (Clusters of Orthologous Groups of proteins) analysis shows that 64.64% of multi-copied genes concentrate on the function of transposase and 86.28% of the COG assigned multi-copied genes concentrate on the COG code of 'L'. Furthermore, the impact of redundant protein coding sequences on the gene prediction results is studied. The results show that the problem of protein coding sequence redundancies cannot be ignored and the consistency of the gene annotation results before and after excluding the redundant sequences is negatively related with the sequences redundancy degree of the protein coding sequences in the training set.

  18. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features.

    PubMed

    Cui, Xiaodong; Wei, Zhen; Zhang, Lin; Liu, Hui; Sun, Lei; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2016-01-01

    Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine) derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq) and show that RNA 5-methylcytosine (m(5)C) is enriched in 5'UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future.

  19. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features

    PubMed Central

    Cui, Xiaodong; Wei, Zhen; Zhang, Lin; Liu, Hui; Sun, Lei; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2016-01-01

    Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine) derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq) and show that RNA 5-methylcytosine (m5C) is enriched in 5′UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future. PMID:27239475

  20. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features.

    PubMed

    Cui, Xiaodong; Wei, Zhen; Zhang, Lin; Liu, Hui; Sun, Lei; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2016-01-01

    Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine) derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq) and show that RNA 5-methylcytosine (m(5)C) is enriched in 5'UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future. PMID:27239475

  1. Whole-Genome Sequencing and Annotation of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum annuum (Bird's Eye Chili) and Capsicum chinense (Yellow Lantern Chili), Respectively.

    PubMed

    Gan, Huan You; Gan, Han Ming; Savka, Michael A; Triassi, Alexander J; Wheatley, Matthew S; Naqvi, Kubra F; Foxhall, Taylor E; Anauo, Michael J; Baldwin, Mariah L; Burkhardt, Russell N; O'Bryon, Isabelle G; Dailey, Lucas K; Busairi, Nurfatini Idayu; Keith, Robert C; Khair, Megat Hazmah Megat Mazhar; Rasul, Muhammad Zamir Mohd; Rosdi, Nur Aiman Mohd; Mountzouros, James R; Rhoads, Aleigha C; Selochan, Melissa A; Tautanov, Timur B; Polter, Steven J; Marks, Kayla D; Caraballo, Alexander A; Hudson, André O

    2015-01-01

    Here, we report the genome sequences of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum spp. Annotation revealed gene clusters for the synthesis of bacilysin, lichensin, and bacillibactin and sporulation killing factor (skfA) in Bacillus safensis RIT372 and turnerbactin and carotenoid in Pseudomonas oryzihabitans RIT370. PMID:25883290

  2. Whole-Genome Sequencing and Annotation of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum annuum (Bird’s Eye Chili) and Capsicum chinense (Yellow Lantern Chili), Respectively

    PubMed Central

    Gan, Huan You; Gan, Han Ming; Savka, Michael A.; Triassi, Alexander J.; Wheatley, Matthew S.; Naqvi, Kubra F.; Foxhall, Taylor E.; Anauo, Michael J.; Baldwin, Mariah L.; Burkhardt, Russell N.; O’Bryon, Isabelle G.; Dailey, Lucas K.; Busairi, Nurfatini Idayu; Keith, Robert C.; Khair, Megat Hazmah Megat Mazhar; Rasul, Muhammad Zamir Mohd; Rosdi, Nur Aiman Mohd; Mountzouros, James R.; Rhoads, Aleigha C.; Selochan, Melissa A.; Tautanov, Timur B.; Polter, Steven J.; Marks, Kayla D.; Caraballo, Alexander A.

    2015-01-01

    Here, we report the genome sequences of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum spp. Annotation revealed gene clusters for the synthesis of bacilysin, lichensin, and bacillibactin and sporulation killing factor (skfA) in Bacillus safensis RIT372 and turnerbactin and carotenoid in Pseudomonas oryzihabitans RIT370. PMID:25883290

  3. Whole-Genome Sequencing and Annotation of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum annuum (Bird's Eye Chili) and Capsicum chinense (Yellow Lantern Chili), Respectively.

    PubMed

    Gan, Huan You; Gan, Han Ming; Savka, Michael A; Triassi, Alexander J; Wheatley, Matthew S; Naqvi, Kubra F; Foxhall, Taylor E; Anauo, Michael J; Baldwin, Mariah L; Burkhardt, Russell N; O'Bryon, Isabelle G; Dailey, Lucas K; Busairi, Nurfatini Idayu; Keith, Robert C; Khair, Megat Hazmah Megat Mazhar; Rasul, Muhammad Zamir Mohd; Rosdi, Nur Aiman Mohd; Mountzouros, James R; Rhoads, Aleigha C; Selochan, Melissa A; Tautanov, Timur B; Polter, Steven J; Marks, Kayla D; Caraballo, Alexander A; Hudson, André O

    2015-04-16

    Here, we report the genome sequences of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum spp. Annotation revealed gene clusters for the synthesis of bacilysin, lichensin, and bacillibactin and sporulation killing factor (skfA) in Bacillus safensis RIT372 and turnerbactin and carotenoid in Pseudomonas oryzihabitans RIT370.

  4. RNA-seq-Based Gene Annotation and Comparative Genomics of Four Fungal Grass Pathogens in the Genus Zymoseptoria Identify Novel Orphan Genes and Species-Specific Invasions of Transposable Elements.

    PubMed

    Grandaubert, Jonathan; Bhattacharyya, Amitava; Stukenbrock, Eva H

    2015-04-27

    The fungal pathogen Zymoseptoria tritici (synonym Mycosphaerella graminicola) is a prominent pathogen of wheat. The reference genome of the isolate IPO323 is one of the best-assembled eukaryotic genomes and encodes more than 10,000 predicted genes. However, a large proportion of the previously annotated gene models are incomplete, with either no start or no stop codons. The availability of RNA-seq data allows better predictions of gene structure. We here used two different RNA-seq datasets, de novo transcriptome assemblies, homology-based comparisons, and trained ab initio gene callers to generate a new gene annotation of Z. tritici IPO323. The annotation pipeline was also applied to re-sequenced genomes of three closely related species of Z. tritici: Z. pseudotritici, Z. ardabiliae, and Z. brevis. Comparative analyses of the predicted gene models using the four Zymoseptoria species revealed sets of species-specific orphan genes enriched with putative pathogenicity-related genes encoding small secreted proteins that may play essential roles in virulence and host specificity. De novo repeat identification allowed us to show that few families of transposable elements are shared between Zymoseptoria species while we observe many species-specific invasions and expansions. The annotation data presented here provide a high-quality resource for future studies of Z. tritici and its sister species and provide detailed insight into gene and genome evolution of fungal plant pathogens.

  5. Genome annotation provides insight into carbon monoxide and hydrogen metabolism in Rubrivivax gelatinosus

    SciTech Connect

    Wawrousek, Karen; Noble, Scott; Korlach, Jonas; Eckert, Carrie; Yu, Jianping; Maness, Pin -Ching

    2014-12-05

    In this article, we report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzyme complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Finally and collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth.

  6. Genome sequencing and annotation of Laceyella sacchari strain GS 1-1, isolated from hot spring, Chumathang, Leh, India.

    PubMed

    Kaur, Navjot; Arora, Amit; Kumar, Narender; Mayilraj, Shanmugam

    2014-12-01

    We report the 3.3-Mb draft genome of Laceyella sacchari strain GS 1-1, isolated from hot spring water sample, Chumathang, Leh, India. Draft genome of strain GS 1-1 consists of 3, 324, 316 bp with a G + C content of 48.8% and 3429 predicted protein coding genes and 75 RNAs. Geobacillus thermodenitrificans strain NG80-2, Geobacillus kaustophilus strain HTA426 and Geobacillus sp. Strain G11MC16 are the closest neighbors of the strain GS 1-1.

  7. De novo assembly and annotation of the Asian tiger mosquito (Aedes albopictus) repeatome with dnaPipeTE from raw genomic reads and comparative analysis with the yellow fever mosquito (Aedes aegypti).

    PubMed

    Goubert, Clément; Modolo, Laurent; Vieira, Cristina; ValienteMoro, Claire; Mavingui, Patrick; Boulesteix, Matthieu

    2015-04-01

    Repetitive DNA, including transposable elements (TEs), is found throughout eukaryotic genomes. Annotating and assembling the "repeatome" during genome-wide analysis often poses a challenge. To address this problem, we present dnaPipeTE-a new bioinformatics pipeline that uses a sample of raw genomic reads. It produces precise estimates of repeated DNA content and TE consensus sequences, as well as the relative ages of TE families. We shows that dnaPipeTE performs well using very low coverage sequencing in different genomes, losing accuracy only with old TE families. We applied this pipeline to the genome of the Asian tiger mosquito Aedes albopictus, an invasive species of human health interest, for which the genome size is estimated to be over 1 Gbp. Using dnaPipeTE, we showed that this species harbors a large (50% of the genome) and potentially active repeatome with an overall TE class and order composition similar to that of Aedes aegypti, the yellow fever mosquito. However, intraorder dynamics show clear distinctions between the two species, with differences at the TE family level. Our pipeline's ability to manage the repeatome annotation problem will make it helpful for new or ongoing assembly projects, and our results will benefit future genomic studies of A. albopictus. PMID:25767248

  8. Interspecific Comparison and annotation of two complete mitochondrial genome sequences from the plant pathogenic fungus Mycosphaerella graminicola

    SciTech Connect

    Millenbaugh, Bonnie A; Pangilinan, Jasmyn L.; Torriani, Stefano F.F.; Goodwin, Stephen B.; Kema, Gert H.J.; McDonald, Bruce A.

    2007-12-07

    The mitochondrial genomes of two isolates of the wheat pathogen Mycosphaerella graminicola were sequenced completely and compared to identify polymorphic regions. This organism is of interest because it is phylogenetically distant from other fungi with sequenced mitochondrial genomes and it has shown discordant patterns of nuclear and mitochondrial diversity. The mitochondrial genome of M. graminicola is a circular molecule of approximately 43,960 bp containing the typical genes coding for 14 proteins related to oxidative phosphorylation, one RNA polymerase, two rRNA genes and a set of 27 tRNAs. The mitochondrial DNA of M. graminicola lacks the gene encoding the putative ribosomal protein (rps5-like), commonly found in fungal mitochondrial genomes. Most of the tRNA genes were clustered with a gene order conserved with many other ascomycetes. A sample of thirty-five additional strains representing the known global mt diversity was partially sequenced to measure overall mitochondrial variability within the species. Little variation was found, confirming previous RFLP-based findings of low mitochondrial diversity. The mitochondrial sequence of M. graminicola is the first reported from the family Mycosphaerellaceae or the order Capnodiales. The sequence also provides a tool to better understand the development of fungicide resistance and the conflicting pattern of high nuclear and low mitochondrial diversity in global populations of this fungus.

  9. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene...

  10. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs

    PubMed Central

    Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1−FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621

  11. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.

  12. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  13. Preserving sequence annotations across reference sequences

    PubMed Central

    2014-01-01

    Background Matching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used. This makes the integration of annotations from different sources difficult and error prone. Results As part of our effort to create linked data for interoperable sequence annotations, we present an RDF data model for sequence annotation using the ontological framework established by the OBO Foundry ontologies and the Basic Formal Ontology (BFO). We defined reference sequences as the common domain of integration for sequence annotations, and identified three semantic relationships between sequence annotations. In doing so, we created the Reference Sequence Annotation to compensate for gaps in the SO and in its mapping to BFO, particularly for annotations that refer to versions of consensus reference sequences. Moreover, we present three integration models for sequence annotations using different reference assemblies. Conclusions We demonstrated a working example of a sequence annotation instance, and how this instance can be linked to other annotations on different reference sequences. Sequence annotations in this format are semantically rich and can be integrated easily with different assemblies. We also identify other challenges of modeling reference sequences with the BFO. PMID:25093075

  14. Genome-wide annotation, expression profiling, and protein interaction studies of the core cell-cycle genes in Phalaenopsis aphrodite.

    PubMed

    Lin, Hsiang-Yin; Chen, Jhun-Chen; Wei, Miao-Ju; Lien, Yi-Chen; Li, Huang-Hsien; Ko, Swee-Suak; Liu, Zin-Huang; Fang, Su-Chiung

    2014-01-01

    Orchidaceae is one of the most abundant and diverse families in the plant kingdom and its unique developmental patterns have drawn the attention of many evolutionary biologists. Particular areas of interest have included the co-evolution of pollinators and distinct floral structures, and symbiotic relationships with mycorrhizal flora. However, comprehensive studies to decipher the molecular basis of growth and development in orchids remain scarce. Cell proliferation governed by cell-cycle regulation is fundamental to growth and development of the plant body. We took advantage of recently released transcriptome information to systematically isolate and annotate the core cell-cycle regulators in the moth orchid Phalaenopsis aphrodite. Our data verified that Phalaenopsis cyclin-dependent kinase A (CDKA) is an evolutionarily conserved CDK. Expression profiling studies suggested that core cell-cycle genes functioning during the G1/S, S, and G2/M stages were preferentially enriched in the meristematic tissues that have high proliferation activity. In addition, subcellular localization and pairwise interaction analyses of various combinations of CDKs and cyclins, and of E2 promoter-binding factors and dimerization partners confirmed interactions of the functional units. Furthermore, our data showed that expression of the core cell-cycle genes was coordinately regulated during pollination-induced reproductive development. The data obtained establish a fundamental framework for study of the cell-cycle machinery in Phalaenopsis orchids.

  15. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species

    PubMed Central

    Wang, Yi; Coleman-Derr, Devin; Chen, Guoping; Gu, Yong Q.

    2015-01-01

    Genome wide analysis of orthologous clusters is an important component of comparative genomics studies. Identifying the overlap among orthologous clusters can enable us to elucidate the function and evolution of proteins across multiple species. Here, we report a web platform named OrthoVenn that is useful for genome wide comparisons and visualization of orthologous clusters. OrthoVenn provides coverage of vertebrates, metazoa, protists, fungi, plants and bacteria for the comparison of orthologous clusters and also supports uploading of customized protein sequences from user-defined species. An interactive Venn diagram, summary counts, and functional summaries of the disjunction and intersection of clusters shared between species are displayed as part of the OrthoVenn result. OrthoVenn also includes in-depth views of the clusters using various sequence analysis tools. Furthermore, OrthoVenn identifies orthologous clusters of single copy genes and allows for a customized search of clusters of specific genes through key words or BLAST. OrthoVenn is an efficient and user-friendly web server freely accessible at http://probes.pw.usda.gov/OrthoVenn or http://aegilops.wheat.ucdavis.edu/OrthoVenn. PMID:25964301

  16. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species.

    PubMed

    Wang, Yi; Coleman-Derr, Devin; Chen, Guoping; Gu, Yong Q

    2015-07-01

    Genome wide analysis of orthologous clusters is an important component of comparative genomics studies. Identifying the overlap among orthologous clusters can enable us to elucidate the function and evolution of proteins across multiple species. Here, we report a web platform named OrthoVenn that is useful for genome wide comparisons and visualization of orthologous clusters. OrthoVenn provides coverage of vertebrates, metazoa, protists, fungi, plants and bacteria for the comparison of orthologous clusters and also supports uploading of customized protein sequences from user-defined species. An interactive Venn diagram, summary counts, and functional summaries of the disjunction and intersection of clusters shared between species are displayed as part of the OrthoVenn result. OrthoVenn also includes in-depth views of the clusters using various sequence analysis tools. Furthermore, OrthoVenn identifies orthologous clusters of single copy genes and allows for a customized search of clusters of specific genes through key words or BLAST. OrthoVenn is an efficient and user-friendly web server freely accessible at http://probes.pw.usda.gov/OrthoVenn or http://aegilops.wheat.ucdavis.edu/OrthoVenn.

  17. Metabolic potential of the organic-solvent tolerant Pseudomonas putida DOT-T1E deduced from its annotated genome

    PubMed Central

    Udaondo, Zulema; Molina, Lazaro; Daniels, Craig; Gómez, Manuel J; Molina-Henares, María A; Matilla, Miguel A; Roca, Amalia; Fernández, Matilde; Duque, Estrella; Segura, Ana; Ramos, Juan Luis

    2013-01-01

    Summary Pseudomonas putida DOT-T1E is an organic solvent tolerant strain capable of degrading aromatic hydrocarbons. Here we report the DOT-T1E genomic sequence (6 394 153 bp) and its metabolic atlas based on the classification of enzyme activities. The genome encodes for at least 1751 enzymatic reactions that account for the known pattern of C, N, P and S utilization by this strain. Based on the potential of this strain to thrive in the presence of organic solvents and the subclasses of enzymes encoded in the genome, its metabolic map can be drawn and a number of potential biotransformation reactions can be deduced. This information may prove useful for adapting desired reactions to create value-added products. This bioengineering potential may be realized via direct transformation of substrates, or may require genetic engineering to block an existing pathway, or to re-organize operons and genes, as well as possibly requiring the recruitment of enzymes from other sources to achieve the desired transformation. Funding Information Work in our laboratory was supported by Fondo Social Europeo and Fondos FEDER from the European Union, through several projects (BIO2010-17227, Consolider-Ingenio CSD2007-00005, Excelencia 2007 CVI-3010, Excelencia 2011 CVI-7391 and EXPLORA BIO2011-12776-E). PMID:23815283

  18. Functional annotation of hypothetical proteins - A review.

    PubMed

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2006-12-29

    The complete human genome sequences in the public database provide ways to understand the blue print of life. As of June 29, 2006, 27 archaeal, 326 bacterial and 21 eukaryotes is complete genomes are available and the sequencing for 316 bacterial, 24 archaeal, 126 eukaryotic genomes are in progress. The traditional biochemical/molecular experiments can assign accurate functions for genes in these genomes. However, the process is time-consuming and costly. Despite several efforts, only 50-60 % of genes have been annotated in most completely sequenced genomes. Automated genome sequence analysis and annotation may provide ways to understand genomes. Thus, determination of protein function is one of the challenging problems of the post-genome era. This demands bioinformatics to predict functions of un-annotated protein sequences by developing efficient tools. Here, we discuss some of the recent and popular approaches developed in Bioinformatics to predict functions for hypothetical proteins.

  19. Annotated ESTs from various tissues of the brown planthopper Nilaparvata lugens: A genomic resource for studying agricultural pests

    PubMed Central

    Noda, Hiroaki; Kawai, Sawako; Koizumi, Yoko; Matsui, Kageaki; Zhang, Qiang; Furukawa, Shigetoyo; Shimomura, Michihiko; Mita, Kazuei

    2008-01-01

    Background The brown planthopper (BPH), Nilaparvata lugens (Hemiptera, Delphacidae), is a serious insect pests of rice plants. Major means of BPH control are application of agricultural chemicals and cultivation of BPH resistant rice varieties. Nevertheless, BPH strains that are resistant to agricultural chemicals have developed, and BPH strains have appeared that are virulent against the resistant rice varieties. Expressed sequence tag (EST) analysis and related applications are useful to elucidate the mechanisms of resistance and virulence and to reveal physiological aspects of this non-model insect, with its poorly understood genetic background. Results More than 37,000 high-quality ESTs, excluding sequences of mitochondrial genome, microbial genomes, and rDNA, have been produced from 18 libraries of various BPH tissues and stages. About 10,200 clusters have been made from whole EST sequences, with average EST size of 627 bp. Among the top ten most abundantly expressed genes, three are unique and show no homology in BLAST searches. The actin gene was highly expressed in BPH, especially in the thorax. Tissue-specifically expressed genes were extracted based on the expression frequency among the libraries. An EST database is available at our web site. Conclusion The EST library will provide useful information for transcriptional analyses, proteomic analyses, and gene functional analyses of BPH. Moreover, specific genes for hemimetabolous insects will be identified. The microarray fabricated based on the EST information will be useful for finding genes related to agricultural and biological problems related to this pest. PMID:18315884

  20. Annotation of differentially expressed genes in the somatic embryogenesis of musa and their location in the banana genome.

    PubMed

    Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-Graciamedrano, Rosa Maria

    2013-01-01

    Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100-4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa.

  1. Genome-wide prediction and annotation of Burkholderia pseudomallei AraC/XylS family transcription regulator.

    PubMed

    Lim, Boon-San; Chong, Chan-Eng; Zamrod, Zulkeflie; Nathan, Sheila; Mohamed, Rahmah

    2007-01-01

    Many members of the AraC/XylS family transcription regulator have been proven to play a critical role in regulating bacterial virulence factors in response to environmental stress. By using the Hidden Markov Model (HMM) profile built from the alignment of a 99 amino acid conserved domain sequence of 273 AraC/XylS family transcription regulators, we detected a total of 45 AraC/XylS family transcription regulators in the genome of the Gram-negative pathogen, Burkholderia pseudomallei. Further in silico analysis of each detected AraC/XylS family transcription regulatory protein and its neighboring genes allowed us to make a first-order guess on the role of some of these transcription regulators in regulating important virulence factors such as those involved in three type III secretion systems and biosynthesis of pyochelin, exopolysaccharide (EPS) and phospholipase C. This paper has demonstrated an efficient and systematic genome-wide scale prediction of the AraC/XylS family that can be applied to other protein families. PMID:18391231

  2. Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis.

    PubMed

    Lees, Jonathan; Yeats, Corin; Perkins, James; Sillitoe, Ian; Rentzsch, Robert; Dessailly, Benoit H; Orengo, Christine

    2012-01-01

    Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein-protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.

  3. Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions

    PubMed Central

    Araújo, Wagner L.; Nunes-Nesi, Adriano; Williams, Thomas C. R.

    2012-01-01

    The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics, and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review, we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g., photosynthesis, photorespiration, and nitrogen metabolism). We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications. PMID:22973288

  4. Rapid High Resolution Genotyping of Francisella tularensis by Whole Genome Sequence Comparison of Annotated Genes (“MLST+”)

    PubMed Central

    Mellmann, Alexander; Höppner, Sebastian; Splettstoesser, Wolf D.; Harmsen, Dag

    2015-01-01

    The zoonotic disease tularemia is caused by the bacterium Francisella tularensis. This pathogen is considered as a category A select agent with potential to be misused in bioterrorism. Molecular typing based on DNA-sequence like canSNP-typing or MLVA has become the accepted standard for this organism. Due to the organism’s highly clonal nature, the current typing methods have reached their limit of discrimination for classifying closely related subpopulations within the subspecies F. tularensis ssp. holarctica. We introduce a new gene-by-gene approach, MLST+, based on whole genome data of 15 sequenced F. tularensis ssp. holarctica strains and apply this approach to investigate an epidemic of lethal tularemia among non-human primates in two animal facilities in Germany. Due to the high resolution of MLST+ we are able to demonstrate that three independent clones of this highly infectious pathogen were responsible for these spatially and temporally restricted outbreaks. PMID:25856198

  5. Functional Annotation of Rheumatoid Arthritis and Osteoarthritis Associated Genes by Integrative Genome-Wide Gene Expression Profiling Analysis

    PubMed Central

    Li, Zhan-Chun; Xiao, Jie; Peng, Jin-Liang; Chen, Jian-Wei; Ma, Tao; Cheng, Guang-Qi; Dong, Yu-Qi; Wang, Wei-li; Liu, Zu-De

    2014-01-01

    Background Rheumatoid arthritis (RA) and osteoarthritis (OA) are two major types of joint diseases that share multiple common symptoms. However, their pathological mechanism remains largely unknown. The aim of our study is to identify RA and OA related-genes and gain an insight into the underlying genetic basis of these diseases. Methods We collected 11 whole genome-wide expression profiling datasets from RA and OA cohorts and performed a meta-analysis to comprehensively investigate their expression signatures. This method can avoid some pitfalls of single dataset analyses. Results and Conclusion We found that several biological pathways (i.e., the immunity, inflammation and apoptosis related pathways) are commonly involved in the development of both RA and OA. Whereas several other pathways (i.e., vasopressin-related pathway, regulation of autophagy, endocytosis, calcium transport and endoplasmic reticulum stress related pathways) present significant difference between RA and OA. This study provides novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease. PMID:24551036

  6. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens

    PubMed Central

    Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  7. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

    PubMed Central

    Akhondi, Saber A.; Klenner, Alexander G.; Tyrchan, Christian; Manchala, Anil K.; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A. R. P.; Sayle, Roger; Kors, Jan A.; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. PMID:25268232

  8. Annotated chemical patent corpus: a gold standard for text mining.

    PubMed

    Akhondi, Saber A; Klenner, Alexander G; Tyrchan, Christian; Manchala, Anil K; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A R P; Sayle, Roger; Kors, Jan A; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

  9. Annotated chemical patent corpus: a gold standard for text mining.

    PubMed

    Akhondi, Saber A; Klenner, Alexander G; Tyrchan, Christian; Manchala, Anil K; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A R P; Sayle, Roger; Kors, Jan A; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. PMID:25268232

  10. Ranking biomedical annotations with annotator's semantic relevancy.

    PubMed

    Wu, Aihua

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  11. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  12. Dragon Plant Biology Explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists.

    PubMed

    Bajic, Vladimir B; Veronika, Merlin; Veladandi, Pardha Sarathi; Meka, Archana; Heng, Mok-Wei; Rajaraman, Kanagasabai; Pan, Hong; Swarup, Sanjay

    2005-08-01

    We introduce a tool for text mining, Dragon Plant Biology Explorer (DPBE) that integrates information on Arabidopsis (Arabidopsis thaliana) genes with their functions, based on gene ontologies and biochemical entity vocabularies, and presents the associations as interactive networks. The associations are based on (1) user-provided PubMed abstracts; (2) a list of Arabidopsis genes compiled by The Arabidopsis Information Resource; (3) user-defined combinations of four vocabulary lists based on the ones developed by the general, plant, and Arabidopsis GO consortia; and (4) three lists developed here based on metabolic pathways, enzymes, and metabolites derived from AraCyc, BRENDA, and other metabolism databases. We demonstrate how various combinations can be applied to fields of (1) gene function and gene interaction analyses, (2) plant development, (3) biochemistry and metabolism, and (4) pharmacology of bioactive compounds. Furthermore, we show the suitability of DPBE for systems approaches by integration with "omics" platform outputs. Using a list of abiotic stress-related genes identified by microarray experiments, we show how this tool can be used to rapidly build an information base on the previously reported relationships. This tool complements the existing biological resources for systems biology by identifying potentially novel associations using text analysis between cellular entities based on genome annotation terms. Thus, it allows researchers to efficiently summarize existing information for a group of genes or pathways, so as to make better informed choices for designing validation experiments. Last, DPBE can be helpful for beginning researchers and graduate students to summarize vast information in an unfamiliar area. DPBE is freely available for academic and nonprofit users at http://research.i2r.a-star.edu.sg/DRAGON/ME2/.

  13. Facilitating functional annotation of chicken microarray data

    PubMed Central

    2009-01-01

    Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO). However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM) tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and will be updated on regular

  14. Mitochondrial Disease Sequence Data Resource (MSeqDR): A global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities

    PubMed Central

    Falk, Marni J.; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T.; Stassen, Alphons P.M.; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G.; Brilhante, Virginia; Ralph, David; DaRe, Jeana T.; Shelton, Robert; Terry, Sharon; Zhang, Zhe; Copeland, William C.; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C.; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2014-01-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The “Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium” is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1,300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is

  15. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities.

    PubMed

    Falk, Marni J; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T; Stassen, Alphons P M; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G; Brilhante, Virginia; Ralph, David; DaRe, Jeana T; Shelton, Robert; Terry, Sharon F; Zhang, Zhe; Copeland, William C; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2015-03-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The "Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium" is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through the use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is

  16. Maize - GO annotation methods, evaluation, and review (Maize-GAMER)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...

  17. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp.

  18. Medical target prediction from genome sequence: combining different sequence analysis algorithms with expert knowledge and input from artificial intelligence approaches.

    PubMed

    Dandekar, T; Du, F; Schirmer, R H; Schmidt, S

    2001-12-01

    By exploiting the rapid increase in available sequence data, the definition of medically relevant protein targets has been improved by a combination of: (i) differential genome analysis (target list): and (ii) analysis of individual proteins (target analysis). Fast sequence comparisons, data mining, and genetic algorithms further promote these procedures. Mycobacterium tuberculosis proteins were chosen as applied examples.

  19. Descriptive Cataloging: A Selected, Annotated Bibliography, 1984-1985.

    ERIC Educational Resources Information Center

    Cook, C. Donald; Jones, Ellen

    1986-01-01

    This annotated bibliography of materials published during 1984-1985 on descriptive cataloging covers bibliographic control, Anglo American Cataloging Rules, 2nd edition (AACR2), specific types of materials, authority control, retrospective conversion, management issues, expert systems, and manuals. (EM)

  20. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/.

  1. The Integrated Microbial Genomes (IMG) System: An Expanding Comparative Analysis Resource

    SciTech Connect

    Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Grechkin, Yuri; Ratner, Anna; Anderson, Iain; Lykidis, Athanasios; Mavromatis, Konstantinos; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2009-09-13

    The integrated microbial genomes (IMG) system serves as a community resource for comparative analysis of publicly available genomes in a comprehensive integrated context. IMG contains both draft and complete microbial genomes integrated with other publicly available genomes from all three domains of life, together with a large number of plasmids and viruses. IMG provides tools and viewers for analyzing and reviewing the annotations of genes and genomes in a comparative context. Since its first release in 2005, IMG's data content and analytical capabilities have been constantly expanded through regular releases. Several companion IMG systems have been set up in order to serve domain specific needs, such as expert review of genome annotations. IMG is available at .

  2. Dictionary-driven protein annotation.

    PubMed

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  3. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    States, David J.

    2004-07-28

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  4. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  5. Comparative genomic survey, exon-intron annotation and phylogenetic analysis of NAT-homologous sequences in archaea, protists, fungi, viruses, and invertebrates

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We have previously published extensive genomic surveys [1-3], reporting NAT-homologous sequences in hundreds of sequenced bacterial, fungal and vertebrate genomes. We present here the results of our latest search of 2445 genomes, representing 1532 (70 archaeal, 1210 bacterial, 43 protist, 97 fungal,...

  6. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    SciTech Connect

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) or individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual

  7. Annotations in Refseq (GSC8 Meeting)

    SciTech Connect

    Tatusova, Tatiana

    2009-09-10

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. Tatiana Tatusova of NCBI discusses "Annotations in Refseq" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009.

  8. Annotations in Refseq (GSC8 Meeting)

    ScienceCinema

    Tatusova, Tatiana

    2016-07-12

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. Tatiana Tatusova of NCBI discusses "Annotations in Refseq" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009.

  9. Functional annotation of hypothetical proteins – A review

    PubMed Central

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2006-01-01

    The complete human genome sequences in the public database provide ways to understand the blue print of life. As of June 29, 2006, 27 archaeal, 326 bacterial and 21 eukaryotes is complete genomes are available and the sequencing for 316 bacterial, 24 archaeal, 126 eukaryotic genomes are in progress. The traditional biochemical/molecular experiments can assign accurate functions for genes in these genomes. However, the process is time-consuming and costly. Despite several efforts, only 50-60 % of genes have been annotated in most completely sequenced genomes. Automated genome sequence analysis and annotation may provide ways to understand genomes. Thus, determination of protein function is one of the challenging problems of the post-genome era. This demands bioinformatics to predict functions of un-annotated protein sequences by developing efficient tools. Here, we discuss some of the recent and popular approaches developed in Bioinformatics to predict functions for hypothetical proteins. PMID:17597916

  10. Expert Biogeographers

    ERIC Educational Resources Information Center

    Bednarski, Marsha

    2006-01-01

    This article describes an alternative way of teaching about biomes by having students become expert biogeographers. In order to become experts students need to first find out what a biogeographer does. Doing an online search lets students find out for themselves what the responsibilities are of people who work in this field. A good place to visit…

  11. Draft Genome Sequence and Annotation of Phyllosphere-Persisting Salmonella enterica subsp. enterica Serovar Livingstone Strain CKY-S4, Isolated from an Urban Lake in Regina, Canada.

    PubMed

    Tambalo, Dinah D; Perry, Benjamin J; Fitzgerald, Stephen F; Cameron, Andrew D S; Yost, Christopher K

    2015-01-01

    Here, we report the first draft genome sequence of Salmonella enterica subsp. enterica serovar Livingstone. This S. Livingstone strain CKY-S4 displayed biofilm formation and cellulose production and could persist on lettuce. This genome may help the study of mechanisms by which enteric pathogens colonize food crops. PMID:26272568

  12. Draft Genome Sequence and Annotation of Phyllosphere-Persisting Salmonella enterica subsp. enterica Serovar Livingstone Strain CKY-S4, Isolated from an Urban Lake in Regina, Canada

    PubMed Central

    Tambalo, Dinah D.; Perry, Benjamin J.; Fitzgerald, Stephen F.; Cameron, Andrew D. S.

    2015-01-01

    Here, we report the first draft genome sequence of Salmonella enterica subsp. enterica serovar Livingstone. This S. Livingstone strain CKY-S4 displayed biofilm formation and cellulose production and could persist on lettuce. This genome may help the study of mechanisms by which enteric pathogens colonize food crops. PMID:26272568

  13. Representing annotation compositionality and provenance for the Semantic Web

    PubMed Central

    2013-01-01

    Background Though the annotation of digital artifacts with metadata has a long history, the bulk of that work focuses on the association of single terms or concepts to single targets. As annotation efforts expand to capture more complex information, annotations will need to be able to refer to knowledge structures formally defined in terms of more atomic knowledge structures. Existing provenance efforts in the Semantic Web domain primarily focus on tracking provenance at the level of whole triples and do not provide enough detail to track how individual triple elements of annotations were derived from triple elements of other annotations. Results We present a task- and domain-independent ontological model for capturing annotations and their linkage to their denoted knowledge representations, which can be singular concepts or more complex sets of assertions. We have implemented this model as an extension of the Information Artifact Ontology in OWL and made it freely available, and we show how it can be integrated with several prominent annotation and provenance models. We present several application areas for the model, ranging from linguistic annotation of text to the annotation of disease-associations in genome sequences. Conclusions With this model, progressively more complex annotations can be composed from other annotations, and the provenance of compositional annotations can be represented at the annotation level or at the level of individual elements of the RDF triples composing the annotations. This in turn allows for progressively richer annotations to be constructed from previous annotation efforts, the precise provenance recording of which facilitates evidence-based inference and error tracking. PMID:24268021

  14. CRISPRdigger: detecting CRISPRs with better direct repeat annotations.

    PubMed

    Ge, Ruiquan; Mai, Guoqin; Wang, Pu; Zhou, Manli; Luo, Youxi; Cai, Yunpeng; Zhou, Fengfeng

    2016-01-01

    Clustered regularly interspaced short palindromic repeats (CRISPRs) are important genetic elements in many bacterial and archaeal genomes, and play a key role in prokaryote immune systems' fight against invasive foreign elements. The CRISPR system has also been engineered to facilitate target gene editing in eukaryotic genomes. Using the common features of mis-annotated CRISPRs in prokaryotic genomes, this study proposed an accurate de novo CRISPR annotation program CRISPRdigger, which can take a partially assembled genome as its input. A comprehensive comparison with the three existing programs demonstrated that CRISPRdigger can recover more Direct Repeats (DRs) for CRISPRs and achieve a higher accuracy for a query genome. The program was implemented by Perl and all the parameters had default values, so that a user could annotate CRISPRs in a query genome by supplying only a genome sequence in the FASTA format. All the supplementary data are available at http://www.healthinformaticslab.org/supp/. PMID:27596864

  15. CRISPRdigger: detecting CRISPRs with better direct repeat annotations

    PubMed Central

    Ge, Ruiquan; Mai, Guoqin; Wang, Pu; Zhou, Manli; Luo, Youxi; Cai, Yunpeng; Zhou, Fengfeng

    2016-01-01

    Clustered regularly interspaced short palindromic repeats (CRISPRs) are important genetic elements in many bacterial and archaeal genomes, and play a key role in prokaryote immune systems’ fight against invasive foreign elements. The CRISPR system has also been engineered to facilitate target gene editing in eukaryotic genomes. Using the common features of mis-annotated CRISPRs in prokaryotic genomes, this study proposed an accurate de novo CRISPR annotation program CRISPRdigger, which can take a partially assembled genome as its input. A comprehensive comparison with the three existing programs demonstrated that CRISPRdigger can recover more Direct Repeats (DRs) for CRISPRs and achieve a higher accuracy for a query genome. The program was implemented by Perl and all the parameters had default values, so that a user could annotate CRISPRs in a query genome by supplying only a genome sequence in the FASTA format. All the supplementary data are available at http://www.healthinformaticslab.org/supp/. PMID:27596864

  16. Towards a Consensus Annotation System (GSC8 Meeting)

    ScienceCinema

    White, Owen [University of Maryland

    2016-07-12

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. "Comparing Annotations: Towards Consensus Annotation" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009

  17. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology.

    PubMed

    Chien, Jung-Ting; Pakala, Suman B; Geraldo, Juliana A; Lapp, Stacey A; Humphrey, Jay C; Barnwell, John W; Kissinger, Jessica C; Galinski, Mary R

    2016-01-01

    Plasmodium coatneyi is a protozoan parasite species that causes simian malaria and is an excellent model for studying disease caused by the human malaria parasite, P. falciparum Here we report the complete (nontelomeric) genome sequence of P. coatneyi Hackeri generated by the application of only Pacific Biosciences RS II (PacBio RS II) single-molecule real-time (SMRT) high-resolution sequence technology and assembly using the Hierarchical Genome Assembly Process (HGAP). This is the first Plasmodium genome sequence reported to use only PacBio technology. This approach has proven to be superior to short-read only approaches for this species. PMID:27587810

  18. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology

    PubMed Central

    Chien, Jung-Ting; Pakala, Suman B.; Geraldo, Juliana A.; Lapp, Stacey A.; Humphrey, Jay C.; Barnwell, John W.; Kissinger, Jessica C.

    2016-01-01

    Plasmodium coatneyi is a protozoan parasite species that causes simian malaria and is an excellent model for studying disease caused by the human malaria parasite, P. falciparum. Here we report the complete (nontelomeric) genome sequence of P. coatneyi Hackeri generated by the application of only Pacific Biosciences RS II (PacBio RS II) single-molecule real-time (SMRT) high-resolution sequence technology and assembly using the Hierarchical Genome Assembly Process (HGAP). This is the first Plasmodium genome sequence reported to use only PacBio technology. This approach has proven to be superior to short-read only approaches for this species. PMID:27587810

  19. k-merSNP discovery: Software for alignment-and reference-free scalable SNP discovery, phylogenetics, and annotation for hundreds of microbial genomes

    SciTech Connect

    2014-11-18

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and trees determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.

  20. k-merSNP discovery: Software for alignment-and reference-free scalable SNP discovery, phylogenetics, and annotation for hundreds of microbial genomes

    2014-11-18

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny inmore » minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and trees determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less

  1. Expert Seeker

    NASA Technical Reports Server (NTRS)

    Fernandez, Becerra

    2003-01-01

    Expert Seeker is a computer program of the knowledge-management-system (KMS) type that falls within the category of expertise-locator systems. The main goal of the KMS system implemented by Expert Seeker is to organize and distribute knowledge of who are the domain experts within and without a given institution, company, or other organization. The intent in developing this KMS was to enable the re-use of organizational knowledge and provide a methodology for querying existing information (including structured, semistructured, and unstructured information) in a way that could help identify organizational experts. More specifically, Expert Seeker was developed to make it possible, by use of an intranet, to do any or all of the following: Assist an employee in identifying who has the skills needed for specific projects and to determine whether the experts so identified are available. Assist managers in identifying employees who may need training opportunities. Assist managers in determining what expertise is lost when employees retire or otherwise leave. Facilitate the development of new ways of identifying opportunities for innovation and minimization of duplicated efforts. Assist employees in achieving competitive advantages through the application of knowledge-management concepts and related systems. Assist external organizations in requesting speakers for specific engagements or determining from whom they might be able to request help via electronic mail. Help foster an environment of collaboration for rapid development in today's environment, in which it is increasingly necessary to assemble teams of experts from government, universities, research laboratories, and industries, to quickly solve problems anytime, anywhere. Make experts more visible. Provide a central repository of information about employees, including information that, heretofore, has typically not been captured by the human-resources systems (e.g., information about past projects, patents, or

  2. Galileo Reader and Annotator

    NASA Astrophysics Data System (ADS)

    Besomi, O.

    2011-06-01

    In his readings, Galileo made frequent use of annotations. Here, I will offer a general glance at them by discussing the case of the annotations to the Libra astronomica published in 1619 by Orazio Grassi, a Jesuit mathematician of the Collegio Romano. The annotations directly reflect Galileo's reaction to Grassi's book in a heated debate between the two astronomers. Galileo and Grassi had opposite ideas about the nature of the comets, which resulted in different scientific and theological implications. The annotations represent the starting point for Galileo's reply to the Libra, namely Il Saggiatore, which was published four years later and dedicated to the new pope Urban VIII.

  3. Insights into the annotated genome sequence of Methanoculleus bourgensis MS2(T), related to dominant methanogens in biogas-producing plants.

    PubMed

    Maus, Irena; Wibberg, Daniel; Stantscheff, Robbin; Stolze, Yvonne; Blom, Jochen; Eikmeyer, Felix-Gregor; Fracowiak, Jochen; König, Helmut; Pühler, Alfred; Schlüter, Andreas

    2015-05-10

    The final step of the biogas production process, the methanogenesis, is frequently dominated by members of the genus Methanoculleus. In particular, the species Methanoculleus bourgensis was identified to play a role in different biogas reactor systems. The genome of the type strain M. bourgensis MS2(T), originally isolated from a sewage sludge digestor, was completely sequenced to analyze putative adaptive genome features conferring competitiveness within biogas reactor environments to the strain. Sequencing and assembly of the M. bourgensis MS2(T) genome yielded a chromosome with a size of 2,789,773 bp. Comparative analysis of M. bourgensis MS2(T) and Methanoculleus marisnigri JR1 revealed significant similarities. The absence of genes for a putative ammonium uptake system may indicate that M. bourgensis MS2(T) is adapted to environments rich in ammonium/ammonia. Specific genes featuring predicted functions in the context of osmolyte production were detected in the genome of M. bourgensis MS2(T). Mapping of metagenome sequences derived from a production-scale biogas plant revealed that M. bourgensis MS2(T) almost completely comprises the genetic information of dominant methanogens present in the biogas reactor analyzed. Hence, availability of the M. bourgensis MS2(T) genome sequence may be valuable regarding further research addressing the performance of Methanoculleus species in agricultural biogas plants.

  4. PREPACT 2.0: Predicting C-to-U and U-to-C RNA Editing in Organelle Genome Sequences with Multiple References and Curated RNA Editing Annotation.

    PubMed

    Lenz, Henning; Knoop, Volker

    2013-01-01

    RNA editing is vast in some genetic systems, with up to thousands of targeted C-to-U and U-to-C substitutions in mitochondria and chloroplasts of certain plants. Efficient prognoses of RNA editing in organelle genomes will help to reveal overlooked cases of editing. We present PREPACT 2.0 (http://www.prepact.de) with numerous enhancements of our previously developed Plant RNA Editing Prediction & Analysis Computer Tool. Reference organelle transcriptomes for editing prediction have been extended and reorganized to include 19 curated mitochondrial and 13 chloroplast genomes, now allowing to distinguish RNA editing sites from "pre-edited" sites. Queries may be run against multiple references and a new "commons" function identifies and highlights orthologous candidate editing sites congruently predicted by multiple references. Enhancements to the BLASTX mode in PREPACT 2.0 allow querying of complete novel organelle genomes within a few minutes, identifying protein genes and candidate RNA editing sites simultaneously without prior user analyses.

  5. Genome sequence and annotation of Trichoderma parareesei, the ancestor of the cellulase producer Trichoderma reesei

    SciTech Connect

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; Karimi, Aghcheh Razieh; Atanasova, Lea; Chenthamara, Komal; Baker, Scott E.; Zhang, Ruifu; Shen, Qirong; Freitag, Michael; Kubicek, Christian P.; Druzhinina, Irina S.

    2015-08-13

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins.

  6. Genome Sequence and Annotation of a Human Infection Isolate of Escherichia coli O26:H11 Involved in a Raw Milk Cheese Outbreak.

    PubMed

    Galia, Wessam; Mariani-Kurkdjian, Patricia; Loukiadis, Estelle; Blanquet-Diot, Stéphanie; Leriche, Françoise; Brugère, Hubert; Shima, Ayaka; Oswald, Eric; Cournoyer, Benoit; Thevenot-Sergentet, Delphine

    2015-01-01

    The consumption of raw milk cheese can expose populations to Shiga toxin-producing Escherichia coli (STEC). We report here the genome sequence of an E. coli O26:H11 strain isolated from humans during the first raw milk cheese outbreak described in France (2005).

  7. Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease.

    PubMed

    Sifrim, Alejandro; Van Houdt, Jeroen Kj; Tranchevent, Leon-Charles; Nowakowska, Beata; Sakai, Ryo; Pavlopoulos, Georgios A; Devriendt, Koen; Vermeesch, Joris R; Moreau, Yves; Aerts, Jan

    2012-01-01

    The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org.

  8. The GATO gene annotation tool for research laboratories.

    PubMed

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB. PMID:16258624

  9. A Novel Approach to Semantic and Coreference Annotation at LLNL

    SciTech Connect

    Firpo, M

    2005-02-04

    A case is made for the importance of high quality semantic and coreference annotation. The challenges of providing such annotation are described. Asperger's Syndrome is introduced, and the connections are drawn between the needs of text annotation and the abilities of persons with Asperger's Syndrome to meet those needs. Finally, a pilot program is recommended wherein semantic annotation is performed by people with Asperger's Syndrome. The primary points embodied in this paper are as follows: (1) Document annotation is essential to the Natural Language Processing (NLP) projects at Lawrence Livermore National Laboratory (LLNL); (2) LLNL does not currently have a system in place to meet its need for text annotation; (3) Text annotation is challenging for a variety of reasons, many related to its very rote nature; (4) Persons with Asperger's Syndrome are particularly skilled at rote verbal tasks, and behavioral experts agree that they would excel at text annotation; and (6) A pilot study is recommend in which two to three people with Asperger's Syndrome annotate documents and then the quality and throughput of their work is evaluated relative to that of their neuro-typical peers.

  10. Effective function annotation through catalytic residue conservation.

    PubMed

    George, Richard A; Spriggs, Ruth V; Bartlett, Gail J; Gutteridge, Alex; MacArthur, Malcolm W; Porter, Craig T; Al-Lazikani, Bissan; Thornton, Janet M; Swindells, Mark B

    2005-08-30

    Because of the extreme impact of genome sequencing projects, protein sequences without accompanying experimental data now dominate public databases. Homology searches, by providing an opportunity to transfer functional information between related proteins, have become the de facto way to address this. Although a single, well annotated, close relationship will often facilitate sufficient annotation, this situation is not always the case, particularly if mutations are present in important functional residues. When only distant relationships are available, the transfer of function information is more tenuous, and the likelihood of encountering several well annotated proteins with different functions is increased. The consequence for a researcher is a range of candidate functions with little way of knowing which, if any, are correct. Here, we address the problem directly by introducing a computational approach to accurately identify and segregate related proteins into those with a functional similarity and those where function differs. This approach should find a wide range of applications, including the interpretation of genomics/proteomics data and the prioritization of targets for high-throughput structure determination. The method is generic, but here we concentrate on enzymes and apply high-quality catalytic site data. In addition to providing a series of comprehensive benchmarks to show the overall performance of our approach, we illustrate its utility with specific examples that include the correct identification of haptoglobin as a nonenzymatic relative of trypsin, discrimination of acid-d-amino acid ligases from a much larger ligase pool, and the successful annotation of BioH, a structural genomics target.

  11. ORegAnno: an open-access community-driven resource for regulatory annotation.

    PubMed

    Griffith, Obi L; Montgomery, Stephen B; Bernier, Bridget; Chu, Bryan; Kasaian, Katayoon; Aerts, Stein; Mahony, Shaun; Sleumer, Monica C; Bilenky, Mikhail; Haeussler, Maximilian; Griffith, Malachi; Gallo, Steven M; Giardine, Belinda; Hooghe, Bart; Van Loo, Peter; Blanco, Enrique; Ticoll, Amy; Lithwick, Stuart; Portales-Casamar, Elodie; Donaldson, Ian J; Robertson, Gordon; Wadelius, Claes; De Bleser, Pieter; Vlieghe, Dominique; Halfon, Marc S; Wasserman, Wyeth; Hardison, Ross; Bergman, Casey M; Jones, Steven J M

    2008-01-01

    ORegAnno is an open-source, open-access database and literature curation system for community-based annotation of experimentally identified DNA regulatory regions, transcription factor binding sites and regulatory variants. The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species. A new feature called the 'publication queue' allows users to input relevant papers from scientific literature as targets for annotation. The queue contains 4438 gene regulation papers entered by experts and another 54 351 identified by text-mining methods. Users can enter or 'check out' papers from the queue for manual curation using a series of user-friendly annotation pages. A typical record entry consists of species, sequence type, sequence, target gene, binding factor, experimental outcome and one or more lines of experimental evidence. An evidence ontology was developed to describe and categorize these experiments. Records are cross-referenced to Ensembl or Entrez gene identifiers, PubMed and dbSNP and can be visualized in the Ensembl or UCSC genome browsers. All data are freely available through search pages, XML data dumps or web services at: http://www.oreganno.org.

  12. ORegAnno: an open-access community-driven resource for regulatory annotation

    PubMed Central

    Griffith, Obi L.; Montgomery, Stephen B.; Bernier, Bridget; Chu, Bryan; Kasaian, Katayoon; Aerts, Stein; Mahony, Shaun; Sleumer, Monica C.; Bilenky, Mikhail; Haeussler, Maximilian; Griffith, Malachi; Gallo, Steven M.; Giardine, Belinda; Hooghe, Bart; Van Loo, Peter; Blanco, Enrique; Ticoll, Amy; Lithwick, Stuart; Portales-Casamar, Elodie; Donaldson, Ian J.; Robertson, Gordon; Wadelius, Claes; De Bleser, Pieter; Vlieghe, Dominique; Halfon, Marc S.; Wasserman, Wyeth; Hardison, Ross; Bergman, Casey M.; Jones, Steven J.M.

    2008-01-01

    ORegAnno is an open-source, open-access database and literature curation system for community-based annotation of experimentally identified DNA regulatory regions, transcription factor binding sites and regulatory variants. The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species. A new feature called the ‘publication queue’ allows users to input relevant papers from scientific literature as targets for annotation. The queue contains 4438 gene regulation papers entered by experts and another 54 351 identified by text-mining methods. Users can enter or ‘check out’ papers from the queue for manual curation using a series of user-friendly annotation pages. A typical record entry consists of species, sequence type, sequence, target gene, binding factor, experimental outcome and one or more lines of experimental evidence. An evidence ontology was developed to describe and categorize these experiments. Records are cross-referenced to Ensembl or Entrez gene identifiers, PubMed and dbSNP and can be visualized in the Ensembl or UCSC genome browsers. All data are freely available through search pages, XML data dumps or web services at: http://www.oreganno.org. PMID:18006570

  13. Expert Systems: What Is an Expert System?

    ERIC Educational Resources Information Center

    Duval, Beverly K.; Main, Linda

    1994-01-01

    Describes expert systems and discusses their use in libraries. Highlights include parts of an expert system; expert system shells; an example of how to build an expert system; a bibliography of 34 sources of information on expert systems in libraries; and a list of 10 expert system shells used in libraries. (Contains five references.) (LRW)

  14. Biosynthesis of Akaeolide and Lorneic Acids and Annotation of Type I Polyketide Synthase Gene Clusters in the Genome of Streptomyces sp. NPS554

    PubMed Central

    Zhou, Tao; Komaki, Hisayuki; Ichikawa, Natsuko; Hosoyama, Akira; Sato, Seizo; Igarashi, Yasuhiro

    2015-01-01

    The incorporation pattern of biosynthetic precursors into two structurally unique polyketides, akaeolide and lorneic acid A, was elucidated by feeding experiments with 13C-labeled precursors. In addition, the draft genome sequence of the producer, Streptomyces sp. NPS554, was performed and the biosynthetic gene clusters for these polyketides were identified. The putative gene clusters contain all the polyketide synthase (PKS) domains necessary for assembly of the carbon skeletons. Combined with the 13C-labeling results, gene function prediction enabled us to propose biosynthetic pathways involving unusual carbon-carbon bond formation reactions. Genome analysis also indicated the presence of at least ten orphan type I PKS gene clusters that might be responsible for the production of new polyketides. PMID:25603349

  15. Annotated genome sequence of Mycobacterium massiliense strain M154, belonging to the recently created taxon Mycobacterium abscessus subsp. bolletii comb. nov.

    PubMed

    Choo, Siew Woh; Wong, Yan Ling; Tan, Joon Liang; Ong, Chia Sui; Wong, Guat Jah; Ng, Kee Peng; Ngeow, Yun Fong

    2012-09-01

    Mycobacterium massiliense has recently been proposed as a member of Mycobacterium abscessus subsp. bolletii comb. nov. Strain M154, a clinical isolate from the bronchoalveolar lavage fluid of a Malaysian patient presenting with lower respiratory tract infection, was subjected to shotgun DNA sequencing with the Illumina sequencing technology to obtain whole-genome sequence data for comparison with other genetically related strains within the M. abscessus species complex. PMID:22887675

  16. PREPACT 2.0: Predicting C-to-U and U-to-C RNA Editing in Organelle Genome Sequences with Multiple References and Curated RNA Editing Annotation

    PubMed Central

    Lenz, Henning; Knoop, Volker

    2013-01-01

    RNA editing is vast in some genetic systems, with up to thousands of targeted C-to-U and U-to-C substitutions in mitochondria and chloroplasts of certain plants. Efficient prognoses of RNA editing in organelle genomes will help to reveal overlooked cases of editing. We present PREPACT 2.0 (http://www.prepact.de) with numerous enhancements of our previously developed Plant RNA Editing Prediction & Analysis Computer Tool. Reference organelle transcriptomes for editing prediction have been extended and reorganized to include 19 curated mitochondrial and 13 chloroplast genomes, now allowing to distinguish RNA editing sites from “pre-edited” sites. Queries may be run against multiple references and a new “commons” function identifies and highlights orthologous candidate editing sites congruently predicted by multiple references. Enhancements to the BLASTX mode in PREPACT 2.0 allow querying of complete novel organelle genomes within a few minutes, identifying protein genes and candidate RNA editing sites simultaneously without prior user analyses. PMID:23362369

  17. The annotated complete DNA sequence of Enterococcus faecalis bacteriophage φEf11 and its comparison with all available phage and predicted prophage genomes.

    PubMed

    Stevens, Roy H; Ektefaie, Mahmoud R; Fouts, Derrick E

    2011-04-01

    φEf11 is a temperate Siphoviridae bacteriophage isolated by induction from a lysogenic Enterococcus faecalis strain. The φEf11 DNA was completely sequenced and found to be 42,822 bp in length, with a G+C mol% of 34.4%. Genome analysis revealed 65 ORFs, accounting for 92.8% of the DNA content. All except for seven of the ORFs displayed sequence similarities to previously characterized proteins. The genes were arranged in functional modules, organized similar to that of several other phages of low GC Gram-positive bacteria; however, the number and arrangement of lysis-related genes were atypical of these bacteriophages. A 159 bp noncoding region between predicted cI and cro genes is highly similar to the functionally characterized early promoter region of lactococcal temperate phage TP901-1, and possessed a predicted stem-loop structure in between predicted P(L) and P(R) promoters, suggesting a novel mechanism of repression of these two bacteriophages from the λ paradigm. Comparison with all available phage and predicted prophage genomes revealed that the φEf11 genome displays unique features, suggesting that φEf11 may be a novel member of a larger family of temperate prophages that also includes lactococcal phages. Trees based on the blast score ratio grouped this family by tail fiber similarity, suggesting that these trees are useful for identifying phages with similar tail fibers. PMID:21204936

  18. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  19. WEGO: a web tool for plotting GO annotations.

    PubMed

    Ye, Jia; Fang, Lin; Zheng, Hongkun; Zhang, Yong; Chen, Jie; Zhang, Zengjin; Wang, Jing; Li, Shengting; Li, Ruiqiang; Bolund, Lars; Wang, Jun

    2006-07-01

    Unified, structured vocabularies and classifications freely provided by the Gene Ontology (GO) Consortium are widely accepted in most of the large scale gene annotation projects. Consequently, many tools have been created for use with the GO ontologies. WEGO (Web Gene Ontology Annotation Plot) is a simple but useful tool for visualizing, comparing and plotting GO annotation results. Different from other commercial software for creating chart, WEGO is designed to deal with the directed acyclic graph structure of GO to facilitate histogram creation of GO annotation results. WEGO has been used widely in many important biological research projects, such as the rice genome project and the silkworm genome project. It has become one of the daily tools for downstream gene annotation analysis, especially when performing comparative genomics tasks. WEGO, along with the two other tools, namely External to GO Query and GO Archive Query, are freely available for all users at http://wego.genomics.org.cn. There are two available mirror sites at http://wego2.genomics.org.cn and http://wego.genomics.com.cn. Any suggestions are welcome at wego@genomics.org.cn. PMID:16845012

  20. Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX

    PubMed Central

    Martínez Barrio, Álvaro; Lagercrantz, Erik; Sperber, Göran O; Blomberg, Jonas; Bongcam-Rudloff, Erik

    2009-01-01

    Background The Distributed Annotation System (DAS) is a widely used network protocol for sharing biological information. The distributed aspects of the protocol enable the use of various reference and annotation servers for connecting biological sequence data to pertinent annotations in order to depict an integrated view of the data for the final user. Results An annotation server has been devised to provide information about the endogenous retroviruses detected and annotated by a specialized in silico tool called RetroTector. We describe the procedure to implement the DAS 1.5 protocol commands necessary for constructing the DAS annotation server. We use our server to exemplify those steps. Data distribution is kept separated from visualization which is carried out by eBioX, an easy to use open source program incorporating multiple bioinformatics utilities. Some well characterized endogenous retroviruses are shown in two different DAS clients. A rapid analysis of areas free from retroviral insertions could be facilitated by our annotations. Conclusion The DAS protocol has shown to be advantageous in the distribution of endogenous retrovirus data. The distributed nature of the protocol is also found to aid in combining annotation and visualization along a genome in order to enhance the understanding of ERV contribution to its evolution. Reference and annotation servers are conjointly used by eBioX to provide visualization of ERV annotations as well as other data sources. Our DAS data source can be found in the central public DAS service repository, , or at . PMID:19534743

  1. Automatic annotation of eukaryotic genes, pseudogenes and promoters

    PubMed Central

    Solovyev, Victor; Kosarev, Peter; Seledsov, Igor; Vorobyev, Denis

    2006-01-01

    Background The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. Results The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. Conclusion We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome. PMID:16925832

  2. Benchmark study of automatic annotation of MALDI-TOF N-glycan profiles.

    PubMed

    Brito, Alejandro E; Kletter, Doron; Singhal, Mudita; Bern, Marshall

    2015-11-01

    Human experts can annotate peaks in MALDI-TOF profiles of detached N-glycans with some degree of accuracy. Even though MALDI-TOF profiles give only intact masses without any fragmentation information, expert knowledge of the most common glycans and biosynthetic pathways in the biological system can point to a small set of most likely glycan structures at the "cartoon" level of detail. Cartoonist is a recently developed, fully automatic annotation tool for MALDI-TOF glycan profiles. Here we benchmark Cartoonist's automatic annotations against human expert annotations on human and mouse N-glycan data from the Consortium for Functional Glycomics. We find that Cartoonist and expert annotations largely agree, but the expert tends to annotate more specifically, meaning fewer suggested structures per peak, and Cartoonist more comprehensively, meaning more annotated peaks. On peaks for which both Cartoonist and the expert give unique cartoons, the two cartoons agree in over 90% of all cases. This article is part of a Special Issue entitled: Computational Proteomics.

  3. PhytoPath: an integrative resource for plant pathogen genomics

    PubMed Central

    Pedro, Helder; Maheswari, Uma; Urban, Martin; Irvine, Alistair George; Cuzick, Alayne; McDowall, Mark D.; Staines, Daniel M.; Kulesha, Eugene; Hammond-Kosack, Kim Elizabeth; Kersey, Paul Julian

    2016-01-01

    PhytoPath (www.phytopathdb.org) is a resource for genomic and phenotypic data from plant pathogen species, that integrates phenotypic data for genes from PHI-base, an expertly curated catalog of genes with experimentally verified pathogenicity, with the Ensembl tools for data visualization and analysis. The resource is focused on fungi, protists (oomycetes) and bacterial plant pathogens that have genomes that have been sequenced and annotated. Genes with associated PHI-base data can be easily identified across all plant pathogen species using a BioMart-based query tool and visualized in their genomic context on the Ensembl genome browser. The PhytoPath resource contains data for 135 genomic sequences from 87 plant pathogen species, and 1364 genes curated for their role in pathogenicity and as targets for chemical intervention. Support for community annotation of gene models is provided using the WebApollo online gene editor, and we are working with interested communities to improve reference annotation for selected species. PMID:26476449

  4. PhytoPath: an integrative resource for plant pathogen genomics.

    PubMed

    Pedro, Helder; Maheswari, Uma; Urban, Martin; Irvine, Alistair George; Cuzick, Alayne; McDowall, Mark D; Staines, Daniel M; Kulesha, Eugene; Hammond-Kosack, Kim Elizabeth; Kersey, Paul Julian

    2016-01-01

    PhytoPath (www.phytopathdb.org) is a resource for genomic and phenotypic data from plant pathogen species, that integrates phenotypic data for genes from PHI-base, an expertly curated catalog of genes with experimentally verified pathogenicity, with the Ensembl tools for data visualization and analysis. The resource is focused on fungi, protists (oomycetes) and bacterial plant pathogens that have genomes that have been sequenced and annotated. Genes with associated PHI-base data can be easily identified across all plant pathogen species using a BioMart-based query tool and visualized in their genomic context on the Ensembl genome browser. The PhytoPath resource contains data for 135 genomic sequences from 87 plant pathogen species, and 1364 genes curated for their role in pathogenicity and as targets for chemical intervention. Support for community annotation of gene models is provided using the WebApollo online gene editor, and we are working with interested communities to improve reference annotation for selected species.

  5. Expert judgment and expert systems

    SciTech Connect

    Mumpower, J.; Phillips, L.D.; Renn, O.; Uppuluri, V.R.R.

    1987-01-01

    This volume collects researchers from the fields of psychology, decision analysis, and artificial intelligence. The purposes were to assess similarities, differences, and complementarities among the three approaches to the study of expert judgment; to evaluate their relative strengths and weaknesses; and to propose profitable linkages between them. Each of the papers in the present volume is directed toward one or more of these goals.

  6. MAGPIE/EGRET Annotation of the 2.9-Mb Drosophila melanogaster Adh Region

    PubMed Central

    Gaasterland, Terry; Sczyrba, Alexander; Thomas, Elizabeth; Aytekin-Kurban, Gulriz; Gordon, Paul; Sensen, Christoph W.

    2000-01-01

    Our challenge in annotating the 2.91-Mb Adh region of the Drosophila melanogaster genome was to identify genetic and genomic features automatically, completely, and precisely within a 6-week period. To do so, we augmented the MAGPIE microbial genome annotation system to handle eukaryotic genomic sequence data. The new configuration required the integration of eukaryotic gene-finding tools and DNA repeat tools into the automatic data collection module. It also required us to define in MAGPIE new strategies to combine data about eukaryotic exon predictions with functional data to refine the exon predictions. At the heart of the resulting new eukaryotic genome annotation system is a reverse comparison of public protein and complementary DNA sequences against the input genome to identify missing exons and to refine exon boundaries. The software modules that add eukaryotic genome annotation capability to MAGPIE are available as EGRET (Eukaryotic Genome Rapid Evaluation Tool). PMID:10779489

  7. Low conservation and species-specific evolution of alternative splicing in humans and mice: comparative genomics analysis using well-annotated full-length cDNAs

    PubMed Central

    Takeda, Jun-ichi; Suzuki, Yutaka; Sakate, Ryuichi; Sato, Yoshiharu; Seki, Masahide; Irie, Takuma; Takeuchi, Nono; Ueda, Takuya; Nakao, Mitsuteru; Sugano, Sumio; Gojobori, Takashi; Imanishi, Tadashi

    2008-01-01

    Using full-length cDNA sequences, we compared alternative splicing (AS) in humans and mice. The alignment of the human and mouse genomes showed that 86% of 199 426 total exons in human AS variants were conserved in the mouse genome. Of the 20 392 total human AS variants, however, 59% consisted of all conserved exons. Comparing AS patterns between human and mouse transcripts revealed that only 431 transcripts from 189 loci were perfectly conserved AS variants. To exclude the possibility that the full-length human cDNAs used in the present study, especially those with retained introns, were cloning artefacts or prematurely spliced transcripts, we experimentally validated 34 such cases. Our results indicate that even retained-intron type transcripts are typically expressed in a highly controlled manner and interact with translating ribosomes. We found non-conserved AS exons to be predominantly outside the coding sequences (CDSs). This suggests that non-conserved exons in the CDSs of transcripts cause functional constraint. These findings should enhance our understanding of the relationship between AS and species specificity of human genes. PMID:18838389

  8. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated

  9. Chicken genomics resource: sequencing and annotation of 35,407 ESTs from single and multiple tissue cDNA libraries and CAP3 assembly of a chicken gene index.

    PubMed

    Carre, Wilfrid; Wang, Xiaofei; Porter, Tom E; Nys, Yves; Tang, Jianshan; Bernberg, Erin; Morgan, Robin; Burnside, Joan; Aggrey, Samuel E; Simon, Jean; Cogburn, Larry A

    2006-05-16

    Its accessibility, unique evolutionary position, and recently assembled genome sequence have advanced the chicken to the forefront of comparative genomics and developmental biology research as a model organism. Several chicken expressed sequence tag (EST) projects have placed the chicken in 10th place for accrued ESTs among all organisms in GenBank. We have completed the single-pass 5'-end sequencing of 37,557 chicken cDNA clones from several single and multiple tissue cDNA libraries and have entered 35,407 EST sequences into GenBank. Our chicken EST sequences and those found in public databases (on July 1, 2004) provided a total of 517,727 public chicken ESTs and mRNAs. These sequences were used in the CAP3 assembly of a chicken gene index composed of 40,850 contigs and 79,192 unassembled singlets. The CAP3 contigs show a 96.7% match to the chicken genome sequence. The University of Delaware (UD) EST collection (43,928 clones) was assembled into 19,237 nonredundant sequences (13,495 contigs and 5,742 unassembled singlets). The UD collection contains 6,223 unique sequences that are not found in other public EST collections but show a 76% match to the chicken genome sequence. Our chicken contig and singlet sequences were annotated according to the highest BlastX and/or BlastN hits. The UD CAP3 contig assemblies and singlets are searchable by nucleotide sequence or key word (http://cogburn.dbi.udel.edu), and the cDNA clones are readily available for distribution from the chick EST website and clone repository (http://www.chickest.udel.edu). The present paper describes the construction and normalization of single and multiple tissue chicken cDNA libraries, high-throughput EST sequencing from these libraries, the CAP3 assembly of a chicken gene index from all public ESTs, and the identification of several nonredundant chicken gene sets for production of custom DNA microarrays.

  10. MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures.

    PubMed

    Niknafs, Noushin; Kim, Dewey; Kim, Ryangguk; Diekhans, Mark; Ryan, Michael; Stenson, Peter D; Cooper, David N; Karchin, Rachel

    2013-11-01

    Mutation position imaging toolbox (MuPIT) interactive is a browser-based application for single-nucleotide variants (SNVs), which automatically maps the genomic coordinates of SNVs onto the coordinates of available three-dimensional (3D) protein structures. The application is designed for interactive browser-based visualization of the putative functional relevance of SNVs by biologists who are not necessarily experts either in bioinformatics or protein structure. Users may submit batches of several thousand SNVs and review all protein structures that cover the SNVs, including available functional annotations such as binding sites, mutagenesis experiments, and common polymorphisms. Multiple SNVs may be mapped onto each structure, enabling 3D visualization of SNV clusters and their relationship to functionally annotated positions. We illustrate the utility of MuPIT interactive in rationalizing the impact of selected polymorphisms in the PharmGKB database, somatic mutations identified in the Cancer Genome Atlas study of invasive breast carcinomas, and rare variants identified in the exome sequencing project. MuPIT interactive is freely available for non-profit use at http://mupit.icm.jhu.edu .

  11. MuPIT Interactive: Webserver for mapping variant positions to annotated, interactive 3D structures

    PubMed Central

    Niknafs, Noushin; Kim, Dewey; Kim, Ryang Guk; Diekhans, Mark; Ryan, Michael; Stenson, Peter D.; Cooper, David N.; Karchin, Rachel

    2013-01-01

    Mutation Position Imaging Toolbox (MuPIT) Interactive is a browser-based application for single nucleotide variants (SNVs), which automatically maps the genomic coordinates of SNVs onto the coordinates of available three-dimensional protein structures. The application is designed for interactive browser-based visualization of the putative functional relevance of SNVs by biologists who are not necessarily experts either in bioinformatics or protein structure. Users may submit batches of several thousand SNVs and review all protein structures that cover the SNVs, including available functional annotations such as binding sites, mutagenesis experiments, and common polymorphisms. Multiple SNVs may be mapped onto each structure, enabling 3D visualization of SNV clusters and their relationship to functionally annotated positions. We illustrate the utility of MuPIT Interactive in rationalizing the impact of selected polymorphisms in the PharmGKB database, somatic mutations identified in the Cancer Genome Atlas study of invasive breast carcinomas, and rare variants identified in the Exome Sequencing Project. MuPIT Interactive is freely available for non-profit use at http://mupit.icm.jhu.edu. PMID:23793516

  12. Genome-Wide Computational Analysis of Musa Microsatellites: Classification, Cross-Taxon Transferability, Functional Annotation, Association with Transposons & miRNAs, and Genetic Marker Potential

    PubMed Central

    Biswas, Manosh Kumar; Liu, Yuxuan; Li, Chunyu; Sheng, Ou; Mayer, Christoph; Yi, Ganjun

    2015-01-01

    The development of organized, informative, robust, user-friendly, and freely accessible molecular markers is imperative to the Musa marker assisted breeding program. Although several hundred SSR markers have already been developed, the number of informative, robust, and freely accessible Musa markers remains inadequate for some breeding applications. In view of this issue, we surveyed SSRs in four different data sets, developed large-scale non-redundant highly informative therapeutic SSR markers, and classified them according to their attributes, as well as analyzed their cross-taxon transferability and utility for the genetic study of Musa and its relatives. A high SSR frequency (177 per Mbp) was found in the Musa genome. AT-rich dinucleotide repeats are predominant, and trinucleotide repeats are the most abundant in transcribed regions. A significant number of Musa SSRs are associated with pre-miRNAs, and 83% of these SSRs are promising candidates for the development of therapeutic SSR markers. Overall, 74% of the SSR markers were polymorphic, and 94% were transferable to at least one Musa spp. Two hundred forty-three markers generated a total of 1047 alleles, with 2-8 alleles each and an average of 4.38 alleles per locus. The PIC values ranged from 0.31 to 0.89 and averaged 0.71. We report the largest set of non-redundant, polymorphic, new SSR markers to be developed in Musa. These additional markers could be a valuable resource for marker-assisted breeding, genetic diversity and genomic studies of Musa and related species. PMID:26121637

  13. Genome-Wide Computational Analysis of Musa Microsatellites: Classification, Cross-Taxon Transferability, Functional Annotation, Association with Transposons & miRNAs, and Genetic Marker Potential.

    PubMed

    Biswas, Manosh Kumar; Liu, Yuxuan; Li, Chunyu; Sheng, Ou; Mayer, Christoph; Yi, Ganjun

    2015-01-01

    The development of organized, informative, robust, user-friendly, and freely accessible molecular markers is imperative to the Musa marker assisted breeding program. Although several hundred SSR markers have already been developed, the number of informative, robust, and freely accessible Musa markers remains inadequate for some breeding applications. In view of this issue, we surveyed SSRs in four different data sets, developed large-scale non-redundant highly informative therapeutic SSR markers, and classified them according to their attributes, as well as analyzed their cross-taxon transferability and utility for the genetic study of Musa and its relatives. A high SSR frequency (177 per Mbp) was found in the Musa genome. AT-rich dinucleotide repeats are predominant, and trinucleotide repeats are the most abundant in transcribed regions. A significant number of Musa SSRs are associated with pre-miRNAs, and 83% of these SSRs are promising candidates for the development of therapeutic SSR markers. Overall, 74% of the SSR markers were polymorphic, and 94% were transferable to at least one Musa spp. Two hundred forty-three markers generated a total of 1047 alleles, with 2-8 alleles each and an average of 4.38 alleles per locus. The PIC values ranged from 0.31 to 0.89 and averaged 0.71. We report the largest set of non-redundant, polymorphic, new SSR markers to be developed in Musa. These additional markers could be a valuable resource for marker-assisted breeding, genetic diversity and genomic studies of Musa and related species.

  14. Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of “Unknown Function”

    PubMed Central

    Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

    2012-01-01

    Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000 Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs. PMID:22645570

  15. ACID: annotation of cassette and integron data

    PubMed Central

    Joss, Michael J; Koenig, Jeremy E; Labbate, Maurizio; Polz, Martin F; Gillings, Michael R; Stokes, Harold W; Doolittle, W Ford; Boucher, Yan

    2009-01-01

    Background Although integrons and their associated gene cassettes are present in ~10% of bacteria and can represent up to 3% of the genome in which they are found, very few have been properly identified and annotated in public databases. These genetic elements have been overlooked in comparison to other vectors that facilitate lateral gene transfer between microorganisms. Description By automating the identification of integron integrase genes and of the non-coding cassette-associated attC recombination sites, we were able to assemble a database containing all publicly available sequence information regarding these genetic elements. Specialists manually curated the database and this information was used to improve the automated detection and annotation of integrons and their encoded gene cassettes. ACID (annotation of cassette and integron data) can be searched using a range of queries and the data can be downloaded in a number of formats. Users can readily annotate their own data and integrate it into ACID using the tools provided. Conclusion ACID is a community resource providing easy access to annotations of integrons and making tools available to detect them in novel sequence data. ACID also hosts a forum to prompt integron-related discussion, which can hopefully lead to a more universal definition of this genetic element. PMID:19383137

  16. Comparative genomics - a perspective.

    PubMed

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2007-03-27

    The rapidly emerging field of comparative genomics has yielded dramatic results. Comparative genome analysis has become feasible with the availability of a number of completely sequenced genomes. Comparison of complete genomes between organisms allow for global views on genome evolution and the availability of many completely sequenced genomes increases the predictive power in deciphering the hidden information in genome design, function and evolution. Thus, comparison of human genes with genes from other genomes in a genomic landscape could help assign novel functions for un-annotated genes. Here, we discuss the recently used techniques for comparative genomics and their derived inferences in genome biology.

  17. Comparative genomics - A perspective

    PubMed Central

    Sivashankari, Selvarajan; Shanmughavel, Piramanayagam

    2007-01-01

    The rapidly emerging field of comparative genomics has yielded dramatic results. Comparative genome analysis has become feasible with the availability of a number of completely sequenced genomes. Comparison of complete genomes between organisms allow for global views on genome evolution and the availability of many completely sequenced genomes increases the predictive power in deciphering the hidden information in genome design, function and evolution. Thus, comparison of human genes with genes from other genomes in a genomic landscape could help assign novel functions for un-annotated genes. Here, we discuss the recently used techniques for comparative genomics and their derived inferences in genome biology. PMID:17597925

  18. GFam: a platform for automatic annotation of gene families

    PubMed Central

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-01-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/. PMID:22790981

  19. Genome-wide annotation and characterization of CLAVATA/ESR (CLE) peptide hormones of soybean (Glycine max) and common bean (Phaseolus vulgaris), and their orthologues of Arabidopsis thaliana

    PubMed Central

    Hastwell, April H.; Gresshoff, Peter M.; Ferguson, Brett J.

    2015-01-01

    CLE peptides are key regulators of cell proliferation and differentiation in plant shoots, roots, vasculature, and legume nodules. They are C-terminally encoded peptides that are post-translationally cleaved and modified from their corresponding pre-propeptides to produce a final ligand that is 12–13 amino acids in length. In this study, an array of bionformatic and comparative genomic approaches was used to identify and characterize the complete family of CLE peptide-encoding genes in two of the world’s most important crop species, soybean and common bean. In total, there are 84 CLE peptide-encoding genes in soybean (considerably more than the 32 present in Arabidopsis), including three pseudogenes and two multi-CLE domain genes having six putative CLE domains each. In addition, 44 CLE peptide-encoding genes were identified in common bean. In silico characterization was used to establish all soybean homeologous pairs, and to identify corresponding gene orthologues present in common bean and Arabidopsis. The soybean CLE pre-propeptide family was further analysed and separated into seven distinct groups based on structure, with groupings strongly associated with the CLE domain sequence and function. These groups provide evolutionary insight into the CLE peptide families of soybean, common bean, and Arabidopsis, and represent a novel tool that can aid in the functional characterization of the peptides. Transcriptional evidence was also used to provide further insight into the location and function of all CLE peptide-encoding members currently available in gene atlases for the three species. Taken together, this in-depth analysis helped to identify and categorize the complete CLE peptide families of soybean and common bean, established gene orthologues within the two legume species, and Arabidopsis, and provided a platform to help compare, contrast, and identify the function of critical CLE peptide hormones in plant development. PMID:26188205

  20. UCSC Data Integrator and Variant Annotation Integrator

    PubMed Central

    Hinrichs, Angie S.; Raney, Brian J.; Speir, Matthew L.; Rhead, Brooke; Casper, Jonathan; Karolchik, Donna; Kuhn, Robert M.; Rosenbloom, Kate R.; Zweig, Ann S.; Haussler, David; Kent, W. James

    2016-01-01

    Summary: Two new tools on the UCSC Genome Browser web site provide improved ways of combining information from multiple datasets, optionally including the user's own custom track data and/or data from track hubs. The Data Integrator combines columns from multiple data tracks, showing all items from the first track along with overlapping items from the other tracks. The Variant Annotation Integrator is tailored to adding functional annotations to variant calls; it offers a more restricted set of underlying data tracks but adds predictions of each variant's consequences for any overlapping or nearby gene transcript. When available, it optionally adds additional annotations including effect prediction scores from dbNSFP for missense mutations, ENCODE regulatory summary tracks and conservation scores. Availability and implementation: The web tools are freely available at http://genome.ucsc.edu/ and the underlying database is available for download at http://hgdownload.cse.ucsc.edu/. The software (written in C and Javascript) is available from https://genome-store.ucsc.edu/ and is freely available for academic and non-profit usage; commercial users must obtain a license. Contact: angie@soe.ucsc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740527

  1. Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

    PubMed

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Fletcher, Anthony R; Plachetzki, David C; FitzGerald, Peter C; Artieri, Carlo G; Atallah, Joel; Barmina, Olga; Brown, James B; Blankenburg, Kerstin P; Clough, Emily; Dasgupta, Abhijit; Gubbala, Sai; Han, Yi; Jayaseelan, Joy C; Kalra, Divya; Kim, Yoo-Ah; Kovar, Christie L; Lee, Sandra L; Li, Mingmei; Malley, James D; Malone, John H; Mathew, Tittu; Mattiuzzo, Nicolas R; Munidasa, Mala; Muzny, Donna M; Ongeri, Fiona; Perales, Lora; Przytycka, Teresa M; Pu, Ling-Ling; Robinson, Garrett; Thornton, Rebecca L; Saada, Nehad; Scherer, Steven E; Smith, Harold E; Vinson, Charles; Warner, Crystal B; Worley, Kim C; Wu, Yuan-Qing; Zou, Xiaoyan; Cherbas, Peter; Kellis, Manolis; Eisen, Michael B; Piano, Fabio; Kionte, Karin; Fitch, David H; Sternberg, Paul W; Cutter, Asher D; Duff, Michael O; Hoskins, Roger A; Graveley, Brenton R; Gibbs, Richard A; Bickel, Peter J; Kopp, Artyom; Carninci, Piero; Celniker, Susan E; Oliver, Brian; Richards, Stephen

    2014-07-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.

  2. Agents and Data Mining in Bioinformatics: Joining Data Gathering and Automatic Annotation with Classification and Distributed Clustering

    NASA Astrophysics Data System (ADS)

    Bazzan, Ana L. C.

    Multiagent systems and data mining techniques are being frequently used in genome projects, especially regarding the annotation process (annotation pipeline). This paper discusses annotation-related problems where agent-based and/or distributed data mining has been successfully employed.

  3. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  4. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  5. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

  6. Annotation of Ehux ESTs

    SciTech Connect

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  7. Annotation: The Savant Syndrome

    ERIC Educational Resources Information Center

    Heaton, Pamela; Wallace, Gregory L.

    2004-01-01

    Background: Whilst interest has focused on the origin and nature of the savant syndrome for over a century, it is only within the past two decades that empirical group studies have been carried out. Methods: The following annotation briefly reviews relevant research and also attempts to address outstanding issues in this research area.…

  8. Intellectuals in China: Annotations.

    ERIC Educational Resources Information Center

    Parker, Franklin

    This annotated bibliography of 72 books, journal articles, government reports, and newspaper feature stories focuses on the changing role of intellectuals in China, primarily since the 1949 Chinese Revolution. Particular attention is given to the Hundred Flowers Movement of 1957 and the Cultural Revolution. Most of the cited works are in English,…

  9. Collaborative Movie Annotation

    NASA Astrophysics Data System (ADS)

    Zad, Damon Daylamani; Agius, Harry

    In this paper, we focus on metadata for self-created movies like those found on YouTube and Google Video, the duration of which are increasing in line with falling upload restrictions. While simple tags may have been sufficient for most purposes for traditionally very short video footage that contains a relatively small amount of semantic content, this is not the case for movies of longer duration which embody more intricate semantics. Creating metadata is a time-consuming process that takes a great deal of individual effort; however, this effort can be greatly reduced by harnessing the power of Web 2.0 communities to create, update and maintain it. Consequently, we consider the annotation of movies within Web 2.0 environments, such that users create and share that metadata collaboratively and propose an architecture for collaborative movie annotation. This architecture arises from the results of an empirical experiment where metadata creation tools, YouTube and an MPEG-7 modelling tool, were used by users to create movie metadata. The next section discusses related work in the areas of collaborative retrieval and tagging. Then, we describe the experiments that were undertaken on a sample of 50 users. Next, the results are presented which provide some insight into how users interact with existing tools and systems for annotating movies. Based on these results, the paper then develops an architecture for collaborative movie annotation.

  10. Annotated Bibliography. First Edition.

    ERIC Educational Resources Information Center

    Haring, Norris G.

    An annotated bibliography which presents approximately 300 references from 1951 to 1973 on the education of severely/profoundly handicapped persons. Citations are grouped alphabetically by author's name within the following categories: characteristics and treatment, gross motor development, sensory and motor development, physical therapy for the…

  11. Ghostwriting: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Simmons, Donald B.

    Drawn from communication journals, historical and news magazines, business and industrial magazines, political science and world affairs journals, general interest periodicals, and literary and political review magazines, the approximately 90 entries in this annotated bibliography discuss ghostwriting as practiced through the ages and reveal the…

  12. WEGO: a web tool for plotting GO annotations

    PubMed Central

    Ye, Jia; Fang, Lin; Zheng, Hongkun; Zhang, Yong; Chen, Jie; Zhang, Zengjin; Wang, Jing; Li, Shengting; Li, Ruiqiang; Bolund, Lars; Wang, Jun

    2006-01-01

    Unified, structured vocabularies and classifications freely provided by the Gene Ontology (GO) Consortium are widely accepted in most of the large scale gene annotation projects. Consequently, many tools have been created for use with the GO ontologies. WEGO (Web Gene Ontology Annotation Plot) is a simple but useful tool for visualizing, comparing and plotting GO annotation results. Different from other commercial software for creating chart, WEGO is designed to deal with the directed acyclic graph structure of GO to facilitate histogram creation of GO annotation results. WEGO has been used widely in many important biological research projects, such as the rice genome project and the silkworm genome project. It has become one of the daily tools for downstream gene annotation analysis, especially when performing comparative genomics tasks. WEGO, along with the two other tools, namely External to GO Query and GO Archive Query, are freely available for all users at . There are two available mirror sites at and . Any suggestions are welcome at wego@genomics.org.cn. PMID:16845012

  13. A Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites.

    PubMed

    Overmars, Lex; Siezen, Roland J; Francke, Christof

    2015-01-01

    The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The method is based on a comparison of the observed and expected distribution of all TISs in a particular genome given prior gene-calling. We have assessed the TIS annotations for all available NCBI RefSeq microbial genomes and found that approximately 87% is of appropriate quality, whereas 13% needs substantial improvement. We have analyzed a number of factors that could affect TIS annotation quality such as GC-content, taxonomy, the fraction of genes with a Shine-Dalgarno sequence and the year of publication. The analysis showed that only the first factor has a clear effect. We have then formulated a straightforward Principle Component Analysis-based TIS identification strategy to self-organize and score potential TISs. The strategy is independent of reference data and a priori calculations. A representative set of 277 genomes was subjected to the analysis and we found a clear increase in TIS annotation quality for the genomes with a low quality score. The PCA-based annotation was also compared with annotation with the current tool of reference, Prodigal. The comparison for the model genome of Escherichia coli K12 showed that both methods supplement each other and that prediction agreement can be used as an indicator of a correct TIS annotation. Importantly, the data suggest that the addition of a PCA-based strategy to a Prodigal prediction can be used to 'flag' TIS annotations for re-evaluation and in addition can be used to evaluate a given annotation in case a Prodigal annotation is lacking.

  14. A Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites

    PubMed Central

    Overmars, Lex; Siezen, Roland J.; Francke, Christof

    2015-01-01

    The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The method is based on a comparison of the observed and expected distribution of all TISs in a particular genome given prior gene-calling. We have assessed the TIS annotations for all available NCBI RefSeq microbial genomes and found that approximately 87% is of appropriate quality, whereas 13% needs substantial improvement. We have analyzed a number of factors that could affect TIS annotation quality such as GC-content, taxonomy, the fraction of genes with a Shine-Dalgarno sequence and the year of publication. The analysis showed that only the first factor has a clear effect. We have then formulated a straightforward Principle Component Analysis-based TIS identification strategy to self-organize and score potential TISs. The strategy is independent of reference data and a priori calculations. A representative set of 277 genomes was subjected to the analysis and we found a clear increase in TIS annotation quality for the genomes with a low quality score. The PCA-based annotation was also compared with annotation with the current tool of reference, Prodigal. The comparison for the model genome of Escherichia coli K12 showed that both methods supplement each other and that prediction agreement can be used as an indicator of a correct TIS annotation. Importantly, the data suggest that the addition of a PCA-based strategy to a Prodigal prediction can be used to ‘flag’ TIS annotations for re-evaluation and in addition can be used to evaluate a given annotation in case a Prodigal annotation is lacking. PMID:26204119

  15. Jannovar: a java library for exome annotation.

    PubMed

    Jäger, Marten; Wang, Kai; Bauer, Sebastian; Smedley, Damian; Krawitz, Peter; Robinson, Peter N

    2014-05-01

    Transcript-based annotation and pedigree analysis are two basic steps in the computational analysis of whole-exome sequencing experiments in genetic diagnostics and disease-gene discovery projects. Here, we present Jannovar, a stand-alone Java application as well as a Java library designed to be used in larger software frameworks for exome and genome analysis. Jannovar uses an interval tree to identify all transcripts affected by a given variant, and provides Human Genome Variation Society-compliant annotations both for variants affecting coding sequences and splice junctions as well as untranslated regions and noncoding RNA transcripts. Jannovar can also perform family-based pedigree analysis with Variant Call Format (VCF) files with data from members of a family segregating a Mendelian disorder. Using a desktop computer, Jannovar requires a few seconds to annotate a typical VCF file with exome data. Jannovar is freely available under the BSD2 license. Source code as well as the Java application and library file can be downloaded from http://compbio.charite.de (with tutorial) and https://github.com/charite/jannovar. PMID:24677618

  16. Statistical algorithms for ontology-based annotation of scientific literature

    PubMed Central

    2014-01-01

    Background Ontologies encode relationships within a domain in robust data structures that can be used to annotate data objects, including scientific papers, in ways that ease tasks such as search and meta-analysis. However, the annotation process requires significant time and effort when performed by humans. Text mining algorithms can facilitate this process, but they render an analysis mainly based upon keyword, synonym and semantic matching. They do not leverage information embedded in an ontology's structure. Methods We present a probabilistic framework that facilitates the automatic annotation of literature by indirectly modeling the restrictions among the different classes in the ontology. Our research focuses on annotating human functional neuroimaging literature within the Cognitive Paradigm Ontology (CogPO). We use an approach that combines the stochastic simplicity of naïve Bayes with the formal transparency of decision trees. Our data structure is easily modifiable to reflect changing domain knowledge. Results We compare our results across naïve Bayes, Bayesian Decision Trees, and Constrained Decision Tree classifiers that keep a human expert in the loop, in terms of the quality measure of the F1-mirco score. Conclusions Unlike traditional text mining algorithms, our framework can model the knowledge encoded by the dependencies in an ontology, albeit indirectly. We successfully exploit the fact that CogPO has explicitly stated restrictions, and implicit dependencies in the form of patterns in the expert curated annotations. PMID:25093071

  17. The Accuracy and Reliability of Crowdsource Annotations of Digital Retinal Images

    PubMed Central

    Mitry, Danny; Zutis, Kris; Dhillon, Baljean; Peto, Tunde; Hayat, Shabina; Khaw, Kay-Tee; Morgan, James E.; Moncur, Wendy; Trucco, Emanuele; Foster, Paul J.

    2016-01-01

    Purpose Crowdsourcing is based on outsourcing computationally intensive tasks to numerous individuals in the online community who have no formal training. Our aim was to develop a novel online tool designed to facilitate large-scale annotation of digital retinal images, and to assess the accuracy of crowdsource grading using this tool, comparing it to expert classification. Methods We used 100 retinal fundus photograph images with predetermined disease criteria selected by two experts from a large cohort study. The Amazon Mechanical Turk Web platform was used to drive traffic to our site so anonymous workers could perform a classification and annotation task of the fundus photographs in our dataset after a short training exercise. Three groups were assessed: masters only, nonmasters only and nonmasters with compulsory training. We calculated the sensitivity, specificity, and area under the curve (AUC) of receiver operating characteristic (ROC) plots for all classifications compared to expert grading, and used the Dice coefficient and consensus threshold to assess annotation accuracy. Results In total, we received 5389 annotations for 84 images (excluding 16 training images) in 2 weeks. A specificity and sensitivity of 71% (95% confidence interval [CI], 69%–74%) and 87% (95% CI, 86%–88%) was achieved for all classifications. The AUC in this study for all classifications combined was 0.93 (95% CI, 0.91–0.96). For image annotation, a maximal Dice coefficient (∼0.6) was achieved with a consensus threshold of 0.25. Conclusions This study supports the hypothesis that annotation of abnormalities in retinal images by ophthalmologically naive individuals is comparable to expert annotation. The highest AUC and agreement with expert annotation was achieved in the nonmasters with compulsory training group. Translational Relevance The use of crowdsourcing as a technique for retinal image analysis may be comparable to expert graders and has the potential to deliver

  18. The Accuracy and Reliability of Crowdsource Annotations of Digital Retinal Images

    PubMed Central

    Mitry, Danny; Zutis, Kris; Dhillon, Baljean; Peto, Tunde; Hayat, Shabina; Khaw, Kay-Tee; Morgan, James E.; Moncur, Wendy; Trucco, Emanuele; Foster, Paul J.

    2016-01-01

    Purpose Crowdsourcing is based on outsourcing computationally intensive tasks to numerous individuals in the online community who have no formal training. Our aim was to develop a novel online tool designed to facilitate large-scale annotation of digital retinal images, and to assess the accuracy of crowdsource grading using this tool, comparing it to expert classification. Methods We used 100 retinal fundus photograph images with predetermined disease criteria selected by two experts from a large cohort study. The Amazon Mechanical Turk Web platform was used to drive traffic to our site so anonymous workers could perform a classification and annotation task of the fundus photographs in our dataset after a short training exercise. Three groups were assessed: masters only, nonmasters only and nonmasters with compulsory training. We calculated the sensitivity, specificity, and area under the curve (AUC) of receiver operating characteristic (ROC) plots for all classifications compared to expert grading, and used the Dice coefficient and consensus threshold to assess annotation accuracy. Results In total, we received 5389 annotations for 84 images (excluding 16 training images) in 2 weeks. A specificity and sensitivity of 71% (95% confidence interval [CI], 69%–74%) and 87% (95% CI, 86%–88%) was achieved for all classifications. The AUC in this study for all classifications combined was 0.93 (95% CI, 0.91–0.96). For image annotation, a maximal Dice coefficient (∼0.6) was achieved with a consensus threshold of 0.25. Conclusions This study supports the hypothesis that annotation of abnormalities in retinal images by ophthalmologically naive individuals is comparable to expert annotation. The highest AUC and agreement with expert annotation was achieved in the nonmasters with compulsory training group. Translational Relevance The use of crowdsourcing as a technique for retinal image analysis may be comparable to expert graders and has the potential to deliver

  19. Expert Systems: An Overview.

    ERIC Educational Resources Information Center

    Adiga, Sadashiv

    1984-01-01

    Discusses: (1) the architecture of expert systems; (2) features that distinguish expert systems from conventional programs; (3) conditions necessary to select a particular application for the development of successful expert systems; (4) issues to be resolved when building expert systems; and (5) limitations. Examples of selected expert systems…

  20. Enhanced Acylcarnitine Annotation in High-Resolution Mass Spectrometry Data: Fragmentation Analysis for the Classification and Annotation of Acylcarnitines

    PubMed Central

    van der Hooft, Justin J. J.; Ridder, Lars; Barrett, Michael P.; Burgess, Karl E. V.

    2015-01-01

    Metabolite annotation and identification are primary challenges in untargeted metabolomics experiments. Rigorous workflows for reliable annotation of mass features with chemical structures or compound classes are needed to enhance the power of untargeted mass spectrometry. High-resolution mass spectrometry considerably improves the confidence in assigning elemental formulas to mass features in comparison to nominal mass spectrometry, and embedding of fragmentation methods enables more reliable metabolite annotations and facilitates metabolite classification. However, the analysis of mass fragmentation spectra can be a time-consuming step and requires expert knowledge. This study demonstrates how characteristic fragmentations, specific to compound classes, can be used to systematically analyze their presence in complex biological extracts like urine that have undergone untargeted mass spectrometry combined with data dependent or targeted fragmentation. Human urine extracts were analyzed using normal phase liquid chromatography (hydrophilic interaction chromatography) coupled to an Ion Trap-Orbitrap hybrid instrument. Subsequently, mass chromatograms and collision-induced dissociation and higher-energy collisional dissociation (HCD) fragments were annotated using the freely available MAGMa software1. Acylcarnitines play a central role in energy metabolism by transporting fatty acids into the mitochondrial matrix. By filtering on a combination of a mass fragment and neutral loss designed based on the MAGMa fragment annotations, we were able to classify and annotate 50 acylcarnitines in human urine extracts, based on high-resolution mass spectrometry HCD fragmentation spectra at different energies for all of them. Of these annotated acylcarnitines, 31 are not described in HMDB yet and for only 4 annotated acylcarnitines the fragmentation spectra could be matched to reference spectra. Therefore, we conclude that the use of mass fragmentation filters within the context

  1. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    SciTech Connect

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  2. Considerations to improve functional annotations in biological databases.

    PubMed

    Benítez-Páez, Alfonso

    2009-12-01

    Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community. PMID:20050264

  3. A survey on annotation tools for the biomedical literature.

    PubMed

    Neves, Mariana; Leser, Ulf

    2014-03-01

    New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

  4. Improving the Annotation of Arabidopsis lyrata Using RNA-Seq Data

    PubMed Central

    Rawat, Vimal; Abdelsamad, Ahmed; Pietzenuk, Björn; Seymour, Danelle K.; Koenig, Daniel; Weigel, Detlef; Pecinka, Ales; Schneeberger, Korbinian

    2015-01-01

    Gene model annotations are important community resources that ensure comparability and reproducibility of analyses and are typically the first step for functional annotation of genomic regions. Without up-to-date genome annotations, genome sequences cannot be used to maximum advantage. It is therefore essential to regularly update gene annotations by integrating the latest information to guarantee that reference annotations can remain a common basis for various types of analyses. Here, we report an improvement of the Arabidopsis lyrata gene annotation using extensive RNA-seq data. This new annotation consists of 31,132 protein coding gene models in addition to 2,089 genes with high similarity to transposable elements. Overall, ~87% of the gene models are corroborated by evidence of expression and 2,235 of these models feature multiple transcripts. Our updated gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved. PMID:26382944

  5. Using Nonexperts for Annotating Pharmacokinetic Drug-Drug Interaction Mentions in Product Labeling: A Feasibility Study

    PubMed Central

    Ning, Yifan; Hernandez, Andres; Horn, John R; Jacobson, Rebecca; Boyce, Richard D

    2016-01-01

    Background Because vital details of potential pharmacokinetic drug-drug interactions are often described in free-text structured product labels, manual curation is a necessary but expensive step in the development of electronic drug-drug interaction information resources. The use of nonexperts to annotate potential drug-drug interaction (PDDI) mentions in drug product label annotation may be a means of lessening the burden of manual curation. Objective Our goal was to explore the practicality of using nonexpert participants to annotate drug-drug interaction descriptions from structured product labels. By presenting annotation tasks to both pharmacy experts and relatively naïve participants, we hoped to demonstrate the feasibility of using nonexpert annotators for drug-drug information annotation. We were also interested in exploring whether and to what extent natural language processing (NLP) preannotation helped improve task completion time, accuracy, and subjective satisfaction. Methods Two experts and 4 nonexperts were asked to annotate 208 structured product label sections under 4 conditions completed sequentially: (1) no NLP assistance, (2) preannotation of drug mentions, (3) preannotation of drug mentions and PDDIs, and (4) a repeat of the no-annotation condition. Results were evaluated within the 2 groups and relative to an existing gold standard. Participants were asked to provide reports on the time required to complete tasks and their perceptions of task difficulty. Results One of the experts and 3 of the nonexperts completed all tasks. Annotation results from the nonexpert group were relatively strong in every scenario and better than the performance of the NLP pipeline. The expert and 2 of the nonexperts were able to complete most tasks in less than 3 hours. Usability perceptions were generally positive (3.67 for expert, mean of 3.33 for nonexperts). Conclusions The results suggest that nonexpert annotation might be a feasible option for comprehensive

  6. FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation

    DOE PAGES

    Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J. P.; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; et al

    2016-06-13

    In this study, nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. Here, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same datamore » format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. In conclusion, our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less

  7. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

    DOE PAGES

    Bolleman, Jerven T.; Mungall, Christopher J.; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J. P.; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshiaki; et al

    2016-06-13

    Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. In this paper, we have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data formatmore » to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Our ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Finally, data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.« less

  8. Translational genomics for plant breeding with the genome sequence explosion.

    PubMed

    Kang, Yang Jae; Lee, Taeyoung; Lee, Jayern; Shim, Sangrea; Jeong, Haneul; Satyawan, Dani; Kim, Moon Young; Lee, Suk-Ha

    2016-04-01

    The use of next-generation sequencers and advanced genotyping technologies has propelled the field of plant genomics in model crops and plants and enhanced the discovery of hidden bridges between genotypes and phenotypes. The newly generated reference sequences of unstudied minor plants can be annotated by the knowledge of model plants via translational genomics approaches. Here, we reviewed the strategies of translational genomics and suggested perspectives on the current databases of genomic resources and the database structures of translated information on the new genome. As a draft picture of phenotypic annotation, translational genomics on newly sequenced plants will provide valuable assistance for breeders and researchers who are interested in genetic studies.

  9. An Ontology-Based GIS for Genomic Data Management of Rumen Microbes

    PubMed Central

    Jelokhani-Niaraki, Saber; Minuchehr, Zarrin; Nassiri, Mohammad Reza

    2015-01-01

    During recent years, there has been exponential growth in biological information. With the emergence of large datasets in biology, life scientists are encountering bottlenecks in handling the biological data. This study presents an integrated geographic information system (GIS)-ontology application for handling microbial genome data. The application uses a linear referencing technique as one of the GIS functionalities to represent genes as linear events on the genome layer, where users can define/change the attributes of genes in an event table and interactively see the gene events on a genome layer. Our application adopted ontology to portray and store genomic data in a semantic framework, which facilitates data-sharing among biology domains, applications, and experts. The application was developed in two steps. In the first step, the genome annotated data were prepared and stored in a MySQL database. The second step involved the connection of the database to both ArcGIS and Protégé as the GIS engine and ontology platform, respectively. We have designed this application specifically to manage the genome-annotated data of rumen microbial populations. Such a GIS-ontology application offers powerful capabilities for visualizing, managing, reusing, sharing, and querying genome-related data. PMID:25873847

  10. An Ontology-Based GIS for Genomic Data Management of Rumen Microbes.

    PubMed

    Jelokhani-Niaraki, Saber; Tahmoorespur, Mojtaba; Minuchehr, Zarrin; Nassiri, Mohammad Reza

    2015-03-01

    During recent years, there has been exponential growth in biological information. With the emergence of large datasets in biology, life scientists are encountering bottlenecks in handling the biological data. This study presents an integrated geographic information system (GIS)-ontology application for handling microbial genome data. The application uses a linear referencing technique as one of the GIS functionalities to represent genes as linear events on the genome layer, where users can define/change the attributes of genes in an event table and interactively see the gene events on a genome layer. Our application adopted ontology to portray and store genomic data in a semantic framework, which facilitates data-sharing among biology domains, applications, and experts. The application was developed in two steps. In the first step, the genome annotated data were prepared and stored in a MySQL database. The second step involved the connection of the database to both ArcGIS and Protégé as the GIS engine and ontology platform, respectively. We have designed this application specifically to manage the genome-annotated data of rumen microbial populations. Such a GIS-ontology application offers powerful capabilities for visualizing, managing, reusing, sharing, and querying genome-related data.

  11. Functional Annotations of Paralogs: A Blessing and a Curse

    PubMed Central

    Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  12. Functional Annotations of Paralogs: A Blessing and a Curse.

    PubMed

    Zallot, Rémi; Harrison, Katherine J; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  13. The UCSC Genome Browser

    PubMed Central

    Karolchik, Donna; Hinrichs, Angie S.; Kent, W. James

    2011-01-01

    The University of California Santa Cruz (UCSC) Genome Browser is a popular Web-based tool for quickly displaying a requested portion of a genome at any scale, accompanied by a series of aligned annotation “tracks.” The annotations generated by the UCSC Genome Bioinformatics Group and external collaborators include gene predictions, mRNA and expressed sequence tag alignments, simple nucleotide polymorphisms, expression and regulatory data, phenotype and variation data, and pairwise and multiple-species comparative genomics data. All information relevant to a region is presented in one window, facilitating biological analysis and interpretation. The database tables underlying the Genome Browser tracks can be viewed, downloaded, and manipulated using another Web-based application, the UCSC Table Browser. Users can upload personal datasets in a wide variety of formats as custom annotation tracks in both browsers for research or educational purposes. PMID:21975940

  14. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific

  15. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific

  16. Speech spectrogram expert

    SciTech Connect

    Johannsen, J.; Macallister, J.; Michalek, T.; Ross, S.

    1983-01-01

    Various authors have pointed out that humans can become quite adept at deriving phonetic transcriptions from speech spectrograms (as good as 90percent accuracy at the phoneme level). The authors describe an expert system which attempts to simulate this performance. The speech spectrogram expert (spex) is actually a society made up of three experts: a 2-dimensional vision expert, an acoustic-phonetic expert, and a phonetics expert. The visual reasoning expert finds important visual features of the spectrogram. The acoustic-phonetic expert reasons about how visual features relates to phonemes, and about how phonemes change visually in different contexts. The phonetics expert reasons about allowable phoneme sequences and transformations, and deduces an english spelling for phoneme strings. The speech spectrogram expert is highly interactive, allowing users to investigate hypotheses and edit rules. 10 references.

  17. Integrative annotation of chromatin elements from ENCODE data

    PubMed Central

    Hoffman, Michael M.; Ernst, Jason; Wilder, Steven P.; Kundaje, Anshul; Harris, Robert S.; Libbrecht, Max; Giardine, Belinda; Ellenbogen, Paul M.; Bilmes, Jeffrey A.; Birney, Ewan; Hardison, Ross C.; Dunham, Ian; Kellis, Manolis; Noble, William Stafford

    2013-01-01

    The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape. PMID:23221638

  18. Complete genome sequence of an attenuated Sparfloxacin-resistant Streptococcus agalactiae strain 138spar

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The complete genome of a sparfloxacin-resistant Streptococcus agalactiae vaccine strain 138spar is 1,838,126 bp in size. The genome has 1892 coding sequences and 82 RNAs. The annotation of the genome is added by the NCBI Prokaryotic Genome Annotation Pipeline. The publishing of this genome will allo...

  19. Drug Education: An Annotated Bibliography.

    ERIC Educational Resources Information Center

    Mathieson, Moira B.

    This bibliography consists of a total of 215 entries dealing with drug education, including curriculum guides, and drawn from documents in the ERIC system. There are two sections, the first containing 130 annotated citations of documents and journal articles, and the second containing 85 citations of journal articles without annotations, but with…

  20. Adult Basic Education Annotated Bibliography.

    ERIC Educational Resources Information Center

    Carter, Nancy B.

    This annotated bibliography contains sections divided according to area of study, and within each category materials are listed alphabetically by publisher. Publishers and mailing addresses are listed at the end of the bibliography. Throughout the annotations, whenever specific grade level divisions are not named, the regular Adult Basic Education…

  1. Morphosyntactic Annotation of CHILDES Transcripts

    ERIC Educational Resources Information Center

    Sagae, Kenji; Davis, Eric; Lavie, Alon; MacWhinney, Brian; Wintner, Shuly

    2010-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database…

  2. Automatic annotation of protein function based on family identification.

    PubMed

    Abascal, Federico; Valencia, Alfonso

    2003-11-15

    Although genomes are being sequenced at an impressive rate, the information generated tells us little about protein function, which is slow to characterize by traditional methods. Automatic protein function annotation based on computational methods has alleviated this imbalance. The most powerful current approach for inferring the function of new proteins is by studying the annotations of their homologues, since their common origin is assumed to be reflected in their structure and function. Unfortunately, as proteins evolve they acquire new functions, so annotation based on homology must be carried out in the context of orthologues or subfamilies. Evolution adds new complications through domain shuffling: homology (or orthology) frequently corresponds to domains rather than complete proteins. Moreover, the function of a protein may be seen as the result of combining the functions of its domains. Additionally, automatic annotation has to deal with problems related to the annotations in the databases: errors (which are likely to be propagated), inconsistencies, or different degrees of function specification. We describe a method that addresses these difficulties for the annotation of protein function. Sequence relationships are detected and measured to obtain a map of the sequence space, which is searched for differentiated groups of proteins (similar to islands on the map), which are expected to have a common function and correspond to groups of orthologues or subfamilies. This mapmaking is done by applying a clustering algorithm based on Normalized cuts in graphs. The domain problem is addressed in a simple way: pairwise local alignments are analyzed to determine the extent to which they cover the entire sequence lengths of the two proteins. This analysis determines both what homologues are preferred for functional inheritance and the level of confidence of the annotation. To alleviate the problems associated with database annotations, the information on all the

  3. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  4. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

  5. A guide to best practices for Gene Ontology (GO) man