Science.gov

Sample records for genome annotation expert

  1. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  2. Bacterial genome annotation.

    PubMed

    Beckloff, Nicholas; Starkenburg, Shawn; Freitas, Tracey; Chain, Patrick

    2012-01-01

    Annotation of prokaryotic sequences can be separated into structural and functional annotation. Structural annotation is dependent on algorithmic interrogation of experimental evidence to discover the physical characteristics of a gene. This is done in an effort to construct accurate gene models, so understanding function or evolution of genes among organisms is not impeded. Functional annotation is dependent on sequence similarity to other known genes or proteins in an effort to assess the function of the gene. Combining structural and functional annotation across genomes in a comparative manner promotes higher levels of accurate annotation as well as an advanced understanding of genome evolution. As the availability of bacterial sequences increases and annotation methods improve, the value of comparative annotation will increase.

  3. A beginner's guide to eukaryotic genome annotation.

    PubMed

    Yandell, Mark; Ence, Daniel

    2012-04-18

    The falling cost of genome sequencing is having a marked impact on the research community with respect to which genomes are sequenced and how and where they are annotated. Genome annotation projects have generally become small-scale affairs that are often carried out by an individual laboratory. Although annotating a eukaryotic genome assembly is now within the reach of non-experts, it remains a challenging task. Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches.

  4. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-12-17

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.

  5. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  6. Automated Microbial Genome Annotation

    SciTech Connect

    Land, Miriam

    2009-05-29

    Miriam Land of the DOE Joint Genome Institute at Oak Ridge National Laboratory gives a talk on the current state and future challenges of moving toward automated microbial genome annotation at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM

  7. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

  8. Automatic annotation of organellar genomes with DOGMA

    SciTech Connect

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  9. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine

    PubMed Central

    Elsik, Christine G.; Tayal, Aditi; Diesh, Colin M.; Unni, Deepak R.; Emery, Marianne L.; Nguyen, Hung N.; Hagen, Darren E.

    2016-01-01

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search. PMID:26578564

  10. Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine.

    PubMed

    Elsik, Christine G; Tayal, Aditi; Diesh, Colin M; Unni, Deepak R; Emery, Marianne L; Nguyen, Hung N; Hagen, Darren E

    2016-01-04

    We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search.

  11. JGI Plant Genomics Gene Annotation Pipeline

    SciTech Connect

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David; Hayes, David; Mitros, Therese

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward this aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.

  12. Visualizing Genomic Annotations with the UCSC Genome Browser.

    PubMed

    Hung, Jui-Hung; Weng, Zhiping

    2016-11-01

    Genomic data and annotations are rapidly accumulating in databases such as the UCSC Genome Browser, NCBI, and Ensembl. Given the massive scale of these genomic databases, it is important to be able to easily retrieve known data and annotations of a specified genomic locus. For example, for a newly identified cis-regulatory element bound by a transcription factor, questions that immediately come to mind include whether the element is near a transcriptional start site and, if so, the name of the corresponding gene, and whether the histones or DNA at the locus are modified. The UCSC Genome Browser organizes data and annotations (called tracks) around the reference sequences or draft assemblies of many eukaryotic genomes and presents them using a powerful web-based graphical interface. This protocol describes how to use the UCSC Genome Browser to visualize selected tracks at specified genomic regions, download the data and annotations for further analysis, and retrieve multiple sequence alignments and their conservation scores.

  13. Genome Annotation and Curation Using MAKER and MAKER-P

    PubMed Central

    Campbell, Michael S.; Holt, Carson; Moore, Barry; Yandell, Mark

    2014-01-01

    This unit describes how to use the genome annotation and curation tools MAKER and MAKER-P to annotate protein coding and non-coding RNA genes in newly assembled genomes, update/combine legacy annotations in light of new evidence, add quality metrics to annotations from other pipelines, and map existing annotations to a new assembly. MAKER and MAKER-P can rapidly annotate genomes of any size, and scale to match available computational resources. PMID:25501943

  14. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    SciTech Connect

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  15. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    SciTech Connect

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.

  16. Bovine Genome Database: integrated tools for genome annotation and discovery.

    PubMed

    Childers, Christopher P; Reese, Justin T; Sundaram, Jaideep P; Vile, Donald C; Dickens, C Michael; Childs, Kevin L; Salih, Hanni; Bennett, Anna K; Hagen, Darren E; Adelson, David L; Elsik, Christine G

    2011-01-01

    The Bovine Genome Database (BGD; http://BovineGenome.org) strives to improve annotation of the bovine genome and to integrate the genome sequence with other genomics data. BGD includes GBrowse genome browsers, the Apollo Annotation Editor, a quantitative trait loci (QTL) viewer, BLAST databases and gene pages. Genome browsers, available for both scaffold and chromosome coordinate systems, display the bovine Official Gene Set (OGS), RefSeq and Ensembl gene models, non-coding RNA, repeats, pseudogenes, single-nucleotide polymorphism, markers, QTL and alignments to complementary DNAs, ESTs and protein homologs. The Bovine QTL viewer is connected to the BGD Chromosome GBrowse, allowing for the identification of candidate genes underlying QTL. The Apollo Annotation Editor connects directly to the BGD Chado database to provide researchers with remote access to gene evidence in a graphical interface that allows editing and creating new gene models. Researchers may upload their annotations to the BGD server for review and integration into the subsequent release of the OGS. Gene pages display information for individual OGS gene models, including gene structure, transcript variants, functional descriptions, gene symbols, Gene Ontology terms, annotator comments and links to National Center for Biotechnology Information and Ensembl. Each gene page is linked to a wiki page to allow input from the research community.

  17. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop.

    PubMed

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-10-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world's biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop.

  18. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop

    PubMed Central

    Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J.; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H.; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

    2010-01-01

    Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world’s biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop. PMID:21994619

  19. Genomic Data and Annotation from the SEED

    DOE Data Explorer

    Fonstein, Michael; Kogan, Yakov; Osterman, Andrei; Overbeek, Ross; Vonstein, Veronika The Fellowship for Interpretation of Genomes (FIG)

    The SEED Project is a cooperative effort to annotate ever-expanding genomic data so researchers can conduct effective comparative analyses of genomes. Launched in 2003 by the Fellowship for Interpretation of Genomes (FIG), the project is one of several initiatives in ongoing development of data curation systems. SEED is designed to be used by scientists from numerous centers and with varied research objectives. As such, several institutions have since joined FIG in a consortium, including the University of Chicago, DOE’s Argonne National Laboratory (ANL), the University of Illinois at Urbana-Champaign, and others. As one example, ANL has used SEED to develop the National Microbial Pathogen Data Resource. Other agencies and institutions have used the project to discover genome components and clarify gene functions such as metabolism. SEED also has enabled researchers to conduct comparative analyses of closely related genomes and has supported derivation of stoichiometric models to understand metabolic processes. The SEED Project has been extended to support metagenomic samples and concomitant analytical tools. Moreover, the number of genomes being introduced into SEED is growing very rapidly. Building a framework to support this growth while providing highly accurate annotations is centrally important to SEED. The project’s subsystem-based annotation strategy has become the technological foundation for addressing these challenges.(copied from Appendix 7 of Systems Biology Knowledgebase for a New Era in Biology, A Genomics:GTL Report from the May 2008 Workshop, DOE/SC-0113, Grequrick, S; Fredrickson, J.K.; Stevens, R., Pub March 1, 2009.)

  20. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics.

    PubMed

    Syme, Robert A; Tan, Kar-Chun; Hane, James K; Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R; Williams, Angela H; Tan, Yew-Foon; Testa, Alison C; Gorman, Jeffrey J; Oliver, Richard P

    2016-01-01

    Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models.

  1. Community annotation and bioinformatics workforce development in concert--Little Skate Genome Annotation Workshops and Jamborees.

    PubMed

    Wang, Qinghua; Arighi, Cecilia N; King, Benjamin L; Polson, Shawn W; Vincent, James; Chen, Chuming; Huang, Hongzhan; Kingham, Brewster F; Page, Shallee T; Rendino, Marc Farnum; Thomas, William Kelley; Udwary, Daniel W; Wu, Cathy H

    2012-01-01

    Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome.

  2. Towards a Library of Standard Operating Procedures (SOPs) for (meta)genomic annotation

    SciTech Connect

    Kyrpides, Nikos; Angiuoli, Samuel V.; Cochrane, Guy; Field, Dawn; Garrity, George; Gussman, Aaron; Kodira, Chinnappa D.; Klimke, William; Kyrpides, Nikos; Madupu, Ramana; Markowitz, Victor; Tatusova, Tatiana; Thomson, Nick; White, Owen

    2008-04-01

    Genome annotations describe the features of genomes and accompany sequences in genome databases. The methodologies used to generate genome annotation are diverse and typically vary amongst groups. Descriptions of the annotation procedure are helpful in interpreting genome annotation data. Standard Operating Procedures (SOPs) for genome annotation describe the processes that generate genome annotations. Some groups are currently documenting procedures but standards are lacking for structure and content of annotation SOPs. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse a central online repository of SOPs.

  3. FIGENIX: Intelligent automation of genomic annotation: expertise integration in a new software platform

    PubMed Central

    Gouret, Philippe; Vitiello, Vérane; Balandraud, Nathalie; Gilles, André; Pontarotti, Pierre; Danchin, Etienne GJ

    2005-01-01

    Background Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps. Results Here we propose an innovative automated platform, FIGENIX, which includes an expert system capable to substitute to human expertise at several key steps. FIGENIX currently automates complex pipelines of structural and functional annotation under the supervision of the expert system (which allows for example to make key decisions, check intermediate results or refine the dataset). The quality of the results produced by FIGENIX is comparable to those obtained by expert biologists with a drastic gain in terms of time costs and avoidance of errors due to the human manipulation of data. Conclusion The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community. They could be easily adapted to new, or more specialized pipelines, such as for example the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest. PMID:16083500

  4. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).

  5. Genome-wide metabolic (re-) annotation of Kluyveromyces lactis

    PubMed Central

    2012-01-01

    Background Even before having its genome sequence published in 2004, Kluyveromyces lactis had long been considered a model organism for studies in genetics and physiology. Research on Kluyveromyces lactis is quite advanced and this yeast species is one of the few with which it is possible to perform formal genetic analysis. Nevertheless, until now, no complete metabolic functional annotation has been performed to the proteins encoded in the Kluyveromyces lactis genome. Results In this work, a new metabolic genome-wide functional re-annotation of the proteins encoded in the Kluyveromyces lactis genome was performed, resulting in the annotation of 1759 genes with metabolic functions, and the development of a methodology supported by merlin (software developed in-house). The new annotation includes novelties, such as the assignment of transporter superfamily numbers to genes identified as transporter proteins. Thus, the genes annotated with metabolic functions could be exclusively enzymatic (1410 genes), transporter proteins encoding genes (301 genes) or have both metabolic activities (48 genes). The new annotation produced by this work largely surpassed the Kluyveromyces lactis currently available annotations. A comparison with KEGG’s annotation revealed a match with 844 (~90%) of the genes annotated by KEGG, while adding 850 new gene annotations. Moreover, there are 32 genes with annotations different from KEGG. Conclusions The methodology developed throughout this work can be used to re-annotate any yeast or, with a little tweak of the reference organism, the proteins encoded in any sequenced genome. The new annotation provided by this study offers basic knowledge which might be useful for the scientific community working on this model yeast, because new functions have been identified for the so-called metabolic genes. Furthermore, it served as the basis for the reconstruction of a compartmentalized, genome-scale metabolic model of Kluyveromyces lactis, which is

  6. The Vertebrate Genome Annotation browser 10 years on.

    PubMed

    Harrow, Jennifer L; Steward, Charles A; Frankish, Adam; Gilbert, James G; Gonzalez, Jose M; Loveland, Jane E; Mudge, Jonathan; Sheppard, Dan; Thomas, Mark; Trevanion, Stephen; Wilming, Laurens G

    2014-01-01

    The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).

  7. Genomic variant annotation workflow for clinical applications

    PubMed Central

    Thurnherr, Thomas; Singer, Franziska; Stekhoven, Daniel J.; Beerenwinkel, Niko

    2016-01-01

    Annotation and interpretation of DNA aberrations identified through next-generation sequencing is becoming an increasingly important task. Even more so in the context of data analysis pipelines for medical applications, where genomic aberrations are associated with phenotypic and clinical features. Here we describe a workflow to identify potential gene targets in aberrated genes or pathways and their corresponding drugs. To this end, we provide the R/Bioconductor package rDGIdb, an R wrapper to query the drug-gene interaction database (DGIdb). DGIdb accumulates drug-gene interaction data from 15 different resources and allows filtering on different levels. The rDGIdb package makes these resources and tools available to R users. Moreover, rDGIdb queries can be automated through incorporation of the rDGIdb package into NGS sequencing pipelines. PMID:27990260

  8. Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

    PubMed Central

    Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David

    2015-01-01

    Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157

  9. Correction of the Caulobacter crescentus NA1000 genome annotation.

    PubMed

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.

  10. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.

    PubMed

    Haas, Brian J; Delcher, Arthur L; Mount, Stephen M; Wortman, Jennifer R; Smith, Roger K; Hannick, Linda I; Maiti, Rama; Ronning, Catherine M; Rusch, Douglas B; Town, Christopher D; Salzberg, Steven L; White, Owen

    2003-10-01

    The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.

  11. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system

    PubMed Central

    Bryson, K.; Loux, V.; Bossy, R.; Nicolas, P.; Chaillou, S.; van de Guchte, M.; Penaud, S.; Maguin, E.; Hoebeke, M.; Bessières, P.; Gibrat, J-F

    2006-01-01

    We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license. PMID:16855290

  12. Genix: a new online automated pipeline for bacterial genome annotation.

    PubMed

    Kremer, Frederico Schmitt; Eslabão, Marcus Redü; Dellagostin, Odir Antônio; Pinto, Luciano da Silva

    2016-12-01

    Next-generation sequencing has significantly reduced the cost of genome-sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high-quality results.

  13. Comparative omics-driven genome annotation refinement: application across Yersiniae.

    PubMed

    Schrimpe-Rutledge, Alexandra C; Jones, Marcus B; Chauhan, Sadhana; Purvine, Samuel O; Sanford, James A; Monroe, Matthew E; Brewer, Heather M; Payne, Samuel H; Ansong, Charles; Frank, Bryan C; Smith, Richard D; Peterson, Scott N; Motin, Vladimir L; Adkins, Joshua N

    2012-01-01

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. The annotation process is now performed almost exclusively in an automated fashion to balance the large number of sequences generated. One possible way of reducing errors inherent to automated computational annotations is to apply data from omics measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. Here, the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species. Transcriptomic and proteomic data derived from highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis Pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 incorrect (i.e., observed frameshifts, extended start sites, and translated pseudogenes) protein-coding sequences within the three current genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes, including the insertion-ablated argD, underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, a transcriptional regulator, and many hypothetical proteins that were missed during annotation.

  14. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  15. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  16. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments

    PubMed Central

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598

  17. The 2008 update of the Aspergillus nidulans genome annotation: a community effort

    PubMed Central

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita; Deegan, Jennifer; Clutterbuck, John; Andersen, Mikael R.; Archer, David; Bencina, Mojca; Braus, Gerhard; Coutinho, Pedro; von Döhren, Hans; Doonan, John; Driessen, Arnold J.M.; Durek, Pawel; Espeso, Eduardo; Fekete, Erzsébet; Flipphi, Michel; Estrada, Carlos Garcia; Geysens, Steven; Goldman, Gustavo; de Groot, Piet W.J.; Hansen, Kim; Harris, Steven D.; Heinekamp, Thorsten; Helmstaedt, Kerstin; Henrissat, Bernard; Hofmann, Gerald; Homan, Tim; Horio, Tetsuya; Horiuchi, Hiroyuki; James, Steve; Jones, Meriel; Karaffa, Levente; Karányi, Zsolt; Kato, Masashi; Keller, Nancy; Kelly, Diane E.; Kiel, Jan A.K.W.; Kim, Jung-Mi; van der Klei, Ida J.; Klis, Frans M.; Kovalchuk, Andriy; Kraševec, Nada; Kubicek, Christian P.; Liu, Bo; MacCabe, Andrew; Meyer, Vera; Mirabito, Pete; Miskei, Márton; Mos, Magdalena; Mullins, Jonathan; Nelson, David R.; Nielsen, Jens; Oakley, Berl R.; Osmani, Stephen A.; Pakula, Tiina; Paszewski, Andrzej; Paulsen, Ian; Pilsyk, Sebastian; Pócsi, István; Punt, Peter J.; Ram, Arthur F.J.; Ren, Qinghu; Robellet, Xavier; Robson, Geoff; Seiboth, Bernhard; Solingen, Piet van; Specht, Thomas; Sun, Jibin; Taheri-Talesh, Naimeh; Takeshita, Norio; Ussery, Dave; vanKuyk, Patricia A.; Visser, Hans; van de Vondervoort, Peter J.I.; de Vries, Ronald P.; Walton, Jonathan; Xiang, Xin; Xiong, Yi; Zeng, An Ping; Brandt, Bernd W.; Cornell, Michael J.; van den Hondel, Cees A.M.J.J.; Visser, Jacob; Oliver, Stephen G.; Turner, Geoffrey

    2010-01-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology. PMID:19146970

  18. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus

  19. AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments.

    PubMed

    Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J

    2011-01-01

    The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis-clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Database URL: http://www.cbil.upenn.edu/annotCompute/

  20. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    PubMed Central

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-01-01

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception. PMID:25666585

  1. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    SciTech Connect

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, III, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  2. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes.

    PubMed

    Brettin, Thomas; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Olsen, Gary J; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D; Shukla, Maulik; Thomason, James A; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  3. Machine learning and genome annotation: a match meant to be?

    PubMed Central

    2013-01-01

    By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE. PMID:23731483

  4. Annotation of the Protein Coding Regions of the Equine Genome.

    PubMed

    Hestand, Matthew S; Kalbfleisch, Theodore S; Coleman, Stephen J; Zeng, Zheng; Liu, Jinze; Orlando, Ludovic; MacLeod, James N

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced mRNA from a pool of forty-three different tissues. From these, we derived the structures of 68,594 transcripts. In addition, we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross-species transcriptional and genomic comparisons.

  5. Annotation of the Protein Coding Regions of the Equine Genome

    PubMed Central

    Hestand, Matthew S.; Kalbfleisch, Theodore S.; Coleman, Stephen J.; Zeng, Zheng; Liu, Jinze; Orlando, Ludovic; MacLeod, James N.

    2015-01-01

    Current gene annotation of the horse genome is largely derived from in silico predictions and cross-species alignments. Only a small number of genes are annotated based on equine EST and mRNA sequences. To expand the number of equine genes annotated from equine experimental evidence, we sequenced mRNA from a pool of forty-three different tissues. From these, we derived the structures of 68,594 transcripts. In addition, we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome, since they are also identified as homozygous variants by genomic DNA resequencing of the reference horse. Taken together, we provide a resource of equine mRNA structures and protein coding variants that will enhance equine and cross-species transcriptional and genomic comparisons. PMID:26107351

  6. OMIGA: Optimized Maker-Based Insect Genome Annotation.

    PubMed

    Liu, Jinding; Xiao, Huamei; Huang, Shuiqing; Li, Fei

    2014-08-01

    Insects are one of the largest classes of animals on Earth and constitute more than half of all living species. The i5k initiative has begun sequencing of more than 5,000 insect genomes, which should greatly help in exploring insect resource and pest control. Insect genome annotation remains challenging because many insects have high levels of heterozygosity. To improve the quality of insect genome annotation, we developed a pipeline, named Optimized Maker-Based Insect Genome Annotation (OMIGA), to predict protein-coding genes from insect genomes. We first mapped RNA-Seq reads to genomic scaffolds to determine transcribed regions using Bowtie, and the putative transcripts were assembled using Cufflink. We then selected highly reliable transcripts with intact coding sequences to train de novo gene prediction software, including Augustus. The re-trained software was used to predict genes from insect genomes. Exonerate was used to refine gene structure and to determine near exact exon/intron boundary in the genome. Finally, we used the software Maker to integrate data from RNA-Seq, de novo gene prediction, and protein alignment to produce an official gene set. The OMIGA pipeline was used to annotate the draft genome of an important insect pest, Chilo suppressalis, yielding 12,548 genes. Different strategies were compared, which demonstrated that OMIGA had the best performance. In summary, we present a comprehensive pipeline for identifying genes in insect genomes that can be widely used to improve the annotation quality in insects. OMIGA is provided at http://ento.njau.edu.cn/omiga.html .

  7. Improving the Caenorhabditis elegans genome annotation using machine learning.

    PubMed

    Rätsch, Gunnar; Sonnenburg, Sören; Srinivasan, Jagan; Witte, Hanh; Müller, Klaus-R; Sommer, Ralf-J; Schölkopf, Bernhard

    2007-02-23

    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

  8. MIPS: analysis and annotation of genome information in 2007.

    PubMed

    Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  9. Expert System for Computer-assisted Annotation of MS/MS Spectra*

    PubMed Central

    Neuhauser, Nadin; Michalski, Annette; Cox, Jürgen; Mann, Matthias

    2012-01-01

    An important step in mass spectrometry (MS)-based proteomics is the identification of peptides by their fragment spectra. Regardless of the identification score achieved, almost all tandem-MS (MS/MS) spectra contain remaining peaks that are not assigned by the search engine. These peaks may be explainable by human experts but the scale of modern proteomics experiments makes this impractical. In computer science, Expert Systems are a mature technology to implement a list of rules generated by interviews with practitioners. We here develop such an Expert System, making use of literature knowledge as well as a large body of high mass accuracy and pure fragmentation spectra. Interestingly, we find that even with high mass accuracy data, rule sets can quickly become too complex, leading to over-annotation. Therefore we establish a rigorous false discovery rate, calculated by random insertion of peaks from a large collection of other MS/MS spectra, and use it to develop an optimized knowledge base. This rule set correctly annotates almost all peaks of medium or high abundance. For high resolution HCD data, median intensity coverage of fragment peaks in MS/MS spectra increases from 58% by search engine annotation alone to 86%. The resulting annotation performance surpasses a human expert, especially on complex spectra such as those of larger phosphorylated peptides. Our system is also applicable to high resolution collision-induced dissociation data. It is available both as a part of MaxQuant and via a webserver that only requires an MS/MS spectrum and the corresponding peptides sequence, and which outputs publication quality, annotated MS/MS spectra (www.biochem.mpg.de/mann/tools/). It provides expert knowledge to beginners in the field of MS-based proteomics and helps advanced users to focus on unusual and possibly novel types of fragment ions. PMID:22888147

  10. An Improved microRNA Annotation of the Canine Genome.

    PubMed

    Penso-Dolfin, Luca; Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions.

  11. Applying negative rule mining to improve genome annotation

    PubMed Central

    Artamonova, Irena I; Frishman, Goar; Frishman, Dmitrij

    2007-01-01

    Background Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection. PMID:17659089

  12. A manual curation strategy to improve genome annotation: application to a set of haloarchael genomes.

    PubMed

    Pfeiffer, Friedhelm; Oesterhelt, Dieter

    2015-06-02

    Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex.

  13. Mouse genome annotation by the RefSeq project.

    PubMed

    McGarvey, Kelly M; Goldfarb, Tamara; Cox, Eric; Farrell, Catherine M; Gupta, Tripti; Joardar, Vinita S; Kodali, Vamsi K; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Webb, David; Wright, Mathew W; Murphy, Terence D; Pruitt, Kim D

    2015-10-01

    Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.

  14. Genome annotation in the presence of insertional RNA editing

    PubMed Central

    Beargie, Christina; Liu, Tsunglin; Corriveau, Mark; Lee, Ha Youn; Gott, Jonatha; Bundschuh, Ralf

    2008-01-01

    Motivation: Insertional RNA editing renders gene prediction very difficult compared to organisms without such RNA editing. A case in point is the mitochondrial genome of Physarum polycephalum in which only about one-third of the number of genes that are to be expected given its length are annotated. Thus, gene prediction methods that explicitly take into account insertional editing are needed for successful annotation of such genomes. Results: We annotate the mitochondrial genome of P.polycephalum using several different approaches for gene prediction in organisms with insertional RNA editing. We computationally validate our annotations by comparing the results from different methods against each other and as proof of concept experimentally validate two of the newly predicted genes. We more than double the number of annotated putative genes in this organism and find several intriguing candidate genes that are not expected in a mitochondrial genome. Availability: The C source code of the programs described here are available upon request from the corresponding author. Contact: bundschuh@mps.ohio-state.edu PMID:18819938

  15. Genome annotation of a Saccharomyces sp. lager brewer's yeast.

    PubMed

    De León-Medina, Patricia Marcela; Elizondo-González, Ramiro; Damas-Buenrostro, Luis Cástulo; Geertman, Jan-Maarten; Van den Broek, Marcel; Galán-Wong, Luis Jesús; Ortiz-López, Rocío; Pereyra-Alférez, Benito

    2016-09-01

    The genome of lager brewer's yeast is a hybrid, with Saccharomyces eubayanus and Saccharomyces cerevisiae as sub-genomes. Due to their specific use in the beer industry, relatively little information is available. The genome of brewing yeast was sequenced and annotated in this study. We obtained a genome size of 22.7 Mbp that consisted of 133 scaffolds, with 65 scaffolds larger than 10 kbp. With respect to the annotation, 9939 genes were obtained, and when they were submitted to a local alignment, we found that 53.93% of these genes corresponded to S. cerevisiae, while another 42.86% originated from S. eubayanus. Our results confirm that our strain is a hybrid of at least two different genomes.

  16. GENCODE: the reference human genome annotation for The ENCODE Project.

    PubMed

    Harrow, Jennifer; Frankish, Adam; Gonzalez, Jose M; Tapanari, Electra; Diekhans, Mark; Kokocinski, Felix; Aken, Bronwen L; Barrell, Daniel; Zadissa, Amonida; Searle, Stephen; Barnes, If; Bignell, Alexandra; Boychenko, Veronika; Hunt, Toby; Kay, Mike; Mukherjee, Gaurab; Rajan, Jeena; Despacio-Reyes, Gloria; Saunders, Gary; Steward, Charles; Harte, Rachel; Lin, Michael; Howald, Cédric; Tanzer, Andrea; Derrien, Thomas; Chrast, Jacqueline; Walters, Nathalie; Balasubramanian, Suganthi; Pei, Baikang; Tress, Michael; Rodriguez, Jose Manuel; Ezkurdia, Iakes; van Baren, Jeltje; Brent, Michael; Haussler, David; Kellis, Manolis; Valencia, Alfonso; Reymond, Alexandre; Gerstein, Mark; Guigó, Roderic; Hubbard, Tim J

    2012-09-01

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

  17. Intra-species sequence comparisons for annotating genomes

    SciTech Connect

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  18. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes

    PubMed Central

    Mende, Daniel R.; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S.; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer

    2017-01-01

    The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. PMID:28053165

  19. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.

    PubMed

    Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer

    2017-01-04

    The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de.

  20. nGASP – the nematode genome annotation assessment project

    PubMed Central

    Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D

    2008-01-01

    Background While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. Results The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. Conclusion This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders. PMID:19099578

  1. Using comparative genome analysis to identify problems in annotated microbial genomes.

    PubMed

    Poptsova, Maria S; Gogarten, J Peter

    2010-07-01

    Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

  2. A semantic analysis of the annotations of the human genome

    PubMed Central

    Khatri, Purvesh; Done, Bogdan; Rao, Archana; Done, Arina

    2008-01-01

    The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene–function assignments. A subsequent literature search found that 138 of these gene–functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene–function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database. Availability http://vortex.cs.wayne.edu/papers/semantic_analysis_bioinfo.pdf Contact sod@cs.wayne.edu PMID:15955782

  3. Genome Annotation in a Community College Cell Biology Lab

    ERIC Educational Resources Information Center

    Beagley, C. Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning…

  4. Annotation of the Clostridium Acetobutylicum Genome

    SciTech Connect

    Daly, M. J.

    2004-06-09

    The genome sequence of the solvent producing bacterium Clostridium acetobutylicum ATCC824, has been determined by the shotgun approach. The genome consists of a 3.94 Mb chromosome and a 192 kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases, closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria.

  5. MITOS: improved de novo metazoan mitochondrial genome annotation.

    PubMed

    Bernt, Matthias; Donath, Alexander; Jühling, Frank; Externbrink, Fabian; Florentz, Catherine; Fritzsch, Guido; Pütz, Joern; Middendorf, Martin; Stadler, Peter F

    2013-11-01

    About 2000 completely sequenced mitochondrial genomes are available from the NCBI RefSeq data base together with manually curated annotations of their protein-coding genes, rRNAs, and tRNAs. This annotation information, which has accumulated over two decades, has been obtained with a diverse set of computational tools and annotation strategies. Despite all efforts of manual curation it is still plagued by misassignments of reading directions, erroneous gene names, and missing as well as false positive annotations in particular for the RNA genes. Taken together, this causes substantial problems for fully automatic pipelines that aim to use these data comprehensively for studies of animal phylogenetics and the molecular evolution of mitogenomes. The MITOS pipeline is designed to compute a consistent de novo annotation of the mitogenomic sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. The MITOS pipeline is accessible online at http://mitos.bioinf.uni-leipzig.de.

  6. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation

    PubMed Central

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M.; Scannapieco, Frank A.

    2016-01-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease. PMID:25911337

  7. MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation.

    PubMed

    Liao, Yu-Chieh; Lin, Hsin-Hung; Sabharwal, Amarpreet; Haase, Elaine M; Scannapieco, Frank A

    2015-06-01

    MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

  8. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches.

  9. An Improved microRNA Annotation of the Canine Genome

    PubMed Central

    Swofford, Ross; Johnson, Jeremy; Alföldi, Jessica; Lindblad-Toh, Kerstin; Swarbreck, David; Moxon, Simon; Di Palma, Federica

    2016-01-01

    The domestic dog, Canis familiaris, is a valuable model for studying human diseases. The publication of the latest Canine genome build and annotation, CanFam3.1 provides an opportunity to enhance our understanding of gene regulation across tissues in the dog model system. In this study, we used the latest dog genome assembly and small RNA sequencing data from 9 different dog tissues to predict novel miRNAs in the dog genome, as well as to annotate conserved miRNAs from the miRBase database that were missing from the current dog annotation. We used both miRCat and miRDeep2 algorithms to computationally predict miRNA loci. The resulting, putative hairpin sequences were analysed in order to discard false positives, based on predicted secondary structures and patterns of small RNA read alignments. Results were further divided into high and low confidence miRNAs, using the same criteria. We generated tissue specific expression profiles for the resulting set of 811 loci: 720 conserved miRNAs, (207 of which had not been previously annotated in the dog genome) and 91 novel miRNA loci. Comparative analyses revealed 8 putative homologues of some novel miRNA in ferret, and one in microbat. All miRNAs were also classified into the genic and intergenic categories, based on the Ensembl RefSeq gene annotation for CanFam3.1. This additionally allowed us to identify four previously undescribed MiRtrons among our total set of miRNAs. We additionally annotated piRNAs, using proTRAC on the same input data. We thus identified 263 putative clusters, most of which (211 clusters) were found to be expressed in testis. Our results represent an important improvement of the dog genome annotation, paving the way to further research on the evolution of gene regulation, as well as on the contribution of post-transcriptional regulation to pathological conditions. PMID:27119849

  10. GI-POP: a combinational annotation and genomic island prediction pipeline for ongoing microbial genome projects.

    PubMed

    Lee, Chi-Ching; Chen, Yi-Ping Phoebe; Yao, Tzu-Jung; Ma, Cheng-Yu; Lo, Wei-Cheng; Lyu, Ping-Chiang; Tang, Chuan Yi

    2013-04-10

    Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project.

  11. MIPS: analysis and annotation of proteins from whole genomes.

    PubMed

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  12. GBshape: a genome browser database for DNA shape annotations.

    PubMed

    Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin; Main, Bradley J; Parker, Stephen C J; Nuzhdin, Sergey V; Tullius, Thomas D; Rohs, Remo

    2015-01-01

    Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.

  13. nGASP - the nematode genome annotation assessment project

    SciTech Connect

    Coghlan, A; Fiedler, T J; McKay, S J; Flicek, P; Harris, T W; Blasiar, D; Allen, J; Stein, L D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome

  14. GEMINI: integrative exploration of genetic variation and genome annotations.

    PubMed

    Paila, Umadevi; Chapman, Brad A; Kirchner, Rory; Quinlan, Aaron R

    2013-01-01

    Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.

  15. Assembly, Annotation, and Analysis of Multiple Mycorrhizal Fungal Genomes

    SciTech Connect

    Initiative Consortium, Mycorrhizal Genomics; Kuo, Alan; Grigoriev, Igor; Kohler, Annegret; Martin, Francis

    2013-03-08

    Mycorrhizal fungi play critical roles in host plant health, soil community structure and chemistry, and carbon and nutrient cycling, all areas of intense interest to the US Dept. of Energy (DOE) Joint Genome Institute (JGI). To this end we are building on our earlier sequencing of the Laccaria bicolor genome by partnering with INRA-Nancy and the mycorrhizal research community in the MGI to sequence and analyze dozens of mycorrhizal genomes of all Basidiomycota and Ascomycota orders and multiple ecological types (ericoid, orchid, and ectomycorrhizal). JGI has developed and deployed high-throughput sequencing techniques, and Assembly, RNASeq, and Annotation Pipelines. In 2012 alone we sequenced, assembled, and annotated 12 draft or improved genomes of mycorrhizae, and predicted ~;;232831 genes and ~;;15011 multigene families, All of this data is publicly available on JGI MycoCosm (http://jgi.doe.gov/fungi/), which provides access to both the genome data and tools with which to analyze the data. Preliminary comparisons of the current total of 14 public mycorrhizal genomes suggest that 1) short secreted proteins potentially involved in symbiosis are more enriched in some orders than in others amongst the mycorrhizal Agaricomycetes, 2) there are wide ranges of numbers of genes involved in certain functional categories, such as signal transduction and post-translational modification, and 3) novel gene families are specific to some ecological types.

  16. GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations

    PubMed Central

    Paila, Umadevi; Chapman, Brad A.; Kirchner, Rory; Quinlan, Aaron R.

    2013-01-01

    Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics. PMID:23874191

  17. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.

  18. Scaling up genome annotation using MAKER and work queue.

    PubMed

    Thrasher, Andrew; Musgrave, Zachary; Kachmarck, Brian; Thain, Douglas; Emrich, Scott

    2014-01-01

    Next generation sequencing technologies have enabled sequencing many genomes. Because of the overall increasing demand and the inherent parallelism available in many required analyses, these bioinformatics applications should ideally run on clusters, clouds and/or grids. We present a modified annotation framework that achieves a speed-up of 45x using 50 workers using a Caenorhabditis japonica test case. We also evaluate these modifications within the Amazon EC2 cloud framework. The underlying genome annotation (MAKER) is parallelised as an MPI application. Our framework enables it to now run without MPI while utilising a wide variety of distributed computing resources. This parallel framework also allows easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that often rely on shared file systems. Combined, our proposed framework can be used, even during early stages of development, to easily run sequence analysis tools on clusters, grids and clouds.

  19. CyanoBase and RhizoBase: databases of manually curated annotations for cyanobacterial and rhizobial genomes.

    PubMed

    Fujisawa, Takatomo; Okamoto, Shinobu; Katayama, Toshiaki; Nakao, Mitsuteru; Yoshimura, Hidehisa; Kajiya-Kanegae, Hiromi; Yamamoto, Sumiko; Yano, Chiyoko; Yanaka, Yuka; Maita, Hiroko; Kaneko, Takakazu; Tabata, Satoshi; Nakamura, Yasukazu

    2014-01-01

    To understand newly sequenced genomes of closely related species, comprehensively curated reference genome databases are becoming increasingly important. We have extended CyanoBase (http://genome.microbedb.jp/cyanobase), a genome database for cyanobacteria, and newly developed RhizoBase (http://genome.microbedb.jp/rhizobase), a genome database for rhizobia, nitrogen-fixing bacteria associated with leguminous plants. Both databases focus on the representation and reusability of reference genome annotations, which are continuously updated by manual curation. Domain experts have extracted names, products and functions of each gene reported in the literature. To ensure effectiveness of this procedure, we developed the TogoAnnotation system offering a web-based user interface and a uniform storage of annotations for the curators of the CyanoBase and RhizoBase databases. The number of references investigated for CyanoBase increased from 2260 in our previous report to 5285, and for RhizoBase, we perused 1216 references. The results of these intensive annotations are displayed on the GeneView pages of each database. Advanced users can also retrieve this information through the representational state transfer-based web application programming interface in an automated manner.

  20. Apollo: a sequence annotation editor.

    PubMed

    Lewis, S E; Searle, S M J; Harris, N; Gibson, M; Lyer, V; Richter, J; Wiel, C; Bayraktaroglu, L; Birney, E; Crosby, M A; Kaminker, J S; Matthews, B B; Prochnik, S E; Smithy, C D; Tupy, J L; Rubin, G M; Misra, S; Mungall, C J; Clamp, M E

    2002-01-01

    The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects.

  1. GANESH: Software for Customized Annotation of Genome Regions

    PubMed Central

    Huntley, Derek; Hummerich, Holger; Smedley, Damian; Kittivoravitkul, Sasivimol; McCarthy, Mark; Little, Peter; Sergot, Marek

    2003-01-01

    GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes. PMID:12952886

  2. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects

    PubMed Central

    2011-01-01

    Background Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. Results We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. Conclusions MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets. PMID:22192575

  3. Design and implementation of a database for Brucella melitensis genome annotation.

    PubMed

    De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

    2008-03-18

    The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.

  4. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation.

    PubMed

    Jackman, Shaun D; Warren, René L; Gibb, Ewan A; Vandervalk, Benjamin P; Mohamadi, Hamid; Chu, Justin; Raymond, Anthony; Pleasance, Stephen; Coope, Robin; Wildung, Mark R; Ritland, Carol E; Bousquet, Jean; Jones, Steven J M; Bohlmann, Joerg; Birol, Inanç

    2015-12-08

    The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.

  5. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation

    PubMed Central

    Jackman, Shaun D.; Warren, René L.; Gibb, Ewan A.; Vandervalk, Benjamin P.; Mohamadi, Hamid; Chu, Justin; Raymond, Anthony; Pleasance, Stephen; Coope, Robin; Wildung, Mark R.; Ritland, Carol E.; Bousquet, Jean; Jones, Steven J. M.; Bohlmann, Joerg; Birol, Inanç

    2016-01-01

    The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data. PMID:26645680

  6. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; ...

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  7. A Post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs

    PubMed Central

    Swain, Martin T.; Tsai, Isheng J.; Assefa, Samual A.; Newbold, Chris; Berriman, Matthew; Otto, Thomas D.

    2013-01-01

    Genome projects now produce draft assemblies within weeks thanks to advanced high-throughput sequencing technologies. For milestone projects like E. coli or H. sapiens, teams of scientists were employed to manually curate and finish these genomes to a high standard. Nowadays, this is not feasible for most projects and the quality of genomes is generally of a much lower standard. This protocol describes software (PAGIT, post-assembly genome-improvement toolkit) to improve the quality of draft genomes. It offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence, and to exploit reference genomes (if available) for improving scaffolding and generating annotations. The protocol is most accessible for bacterial and small Eukaryotic genomes (up to 300 Mb), such as pathogenic bacteria, malaria and parasitic worms. Applying PAGIT to an E. coli assembly takes approximately 24 hours: it doubles the average contig size and annotates over 4300 gene models. PMID:22678431

  8. The power of EST sequence data: Relation to Acyrthosiphon pisum genome annotation and functional genomics initiatives

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genes important to aphid biology, survival and reproduction were successfully identified by use of a genomics approach. We created and described the Sequencing, compilation, and annotation of the approxiamtely 525Mb nuclear genome of the pea aphid, Acyrthosiphon pisum, which represents an important ...

  9. The Genomic CDS Sandbox: An Assessment Among Domain Experts

    PubMed Central

    Aziz, Ayesha; Kawamoto, Kensaku; Eilbeck, Karen; Williams, Marc S.; Freimuth, Robert R.; Hoffman, Mark A.; Rasmussen, Luke V.; Overby, Casey L.; Shirts, Brian H.; Hoffman, James M.; Welch, Brandon M.

    2016-01-01

    Genomics is a promising tool that is becoming more widely available to improve the care and treatment of individuals. While there is much assertion, genomics will most certainly require the use of clinical decision support (CDS) to be fully realized in the routine clinical setting. The National Human Genome Research Institute (NHGRI) of the National Institutes of Health recently convened an in-person, multi-day meeting on this topic. It was widely recognized that there is a need to promote the innovation and development of resources for genomic CDS such as a CDS sandbox. The purpose of this study was to evaluate a proposed approach for such a genomic CDS sandbox among domain experts and potential users. Survey results indicate a significant interest and desire for a genomic CDS sandbox environment among domain experts. These results will be used to guide the development of a genomic CDS sandbox. PMID:26778834

  10. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud[OPEN

    PubMed Central

    Merchant, Nirav

    2016-01-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  11. MarinegenomicsDB: an integrated genome viewer for community-based annotation of genomes.

    PubMed

    Koyanagi, Ryo; Takeuchi, Takeshi; Hisata, Kanako; Gyoja, Fuki; Shoguchi, Eiichi; Satoh, Nori; Kawashima, Takeshi

    2013-10-01

    We constructed a web-based genome annotation platform, MarinegenomicsDB, to integrate genome data from various marine organisms including the pearl oyster Pinctada fucata and the coral Acropora digitifera. This newly developed viewer application provides open access to published data and a user-friendly environment for community-based manual gene annotation. Development on a flexible framework enables easy expansion of the website on demand. To date, more than 2000 genes have been annotated using this system. In the future, the website will be expanded to host a wider variety of data, more species, and different types of genome-wide analyses. The website is available at the following URL: http://marinegenomics.oist.jp.

  12. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd.

    PubMed

    Irshad, H; Montaser-Kouhsari, L; Waltz, G; Bucur, O; Nowak, J A; Dong, F; Knoblauch, N W; Beck, A H

    2015-01-01

    The development of tools in computational pathology to assist physicians and biomedical scientists in the diagnosis of disease requires access to high-quality annotated images for algorithm learning and evaluation. Generating high-quality expert-derived annotations is time-consuming and expensive. We explore the use of crowdsourcing for rapidly obtaining annotations for two core tasks in com- putational pathology: nucleus detection and nucleus segmentation. We designed and implemented crowdsourcing experiments using the CrowdFlower platform, which provides access to a large set of labor channel partners that accesses and manages millions of contributors worldwide. We obtained annotations from four types of annotators and compared concordance across these groups. We obtained: crowdsourced annotations for nucleus detection and segmentation on a total of 810 images; annotations using automated methods on 810 images; annotations from research fellows for detection and segmentation on 477 and 455 images, respectively; and expert pathologist-derived annotations for detection and segmentation on 80 and 63 images, respectively. For the crowdsourced annotations, we evaluated performance across a range of contributor skill levels (1, 2, or 3). The crowdsourced annotations (4,860 images in total) were completed in only a fraction of the time and cost required for obtaining annotations using traditional methods. For the nucleus detection task, the research fellow-derived annotations showed the strongest concordance with the expert pathologist- derived annotations (F-M =93.68%), followed by the crowd-sourced contributor levels 1,2, and 3 and the automated method, which showed relatively similar performance (F-M = 87.84%, 88.49%, 87.26%, and 86.99%, respectively). For the nucleus segmentation task, the crowdsourced contributor level 3-derived annotations, research fellow-derived annotations, and automated method showed the strongest concordance with the expert pathologist

  13. CROWDSOURCING IMAGE ANNOTATION FOR NUCLEUS DETECTION AND SEGMENTATION IN COMPUTATIONAL PATHOLOGY: EVALUATING EXPERTS, AUTOMATED METHODS, AND THE CROWD

    PubMed Central

    Irshad, H.; Montaser-Kouhsari, L.; Waltz, G.; Bucur, O.; Nowak, J.A.; Dong, F.; Knoblauch, N.W.; Beck, A. H.

    2014-01-01

    The development of tools in computational pathology to assist physicians and biomedical scientists in the diagnosis of disease requires access to high-quality annotated images for algorithm learning and evaluation. Generating high-quality expert-derived annotations is time-consuming and expensive. We explore the use of crowdsourcing for rapidly obtaining annotations for two core tasks in computational pathology: nucleus detection and nucleus segmentation. We designed and implemented crowdsourcing experiments using the CrowdFlower platform, which provides access to a large set of labor channel partners that accesses and manages millions of contributors worldwide. We obtained annotations from four types of annotators and compared concordance across these groups. We obtained: crowdsourced annotations for nucleus detection and segmentation on a total of 810 images; annotations using automated methods on 810 images; annotations from research fellows for detection and segmentation on 477 and 455 images, respectively; and expert pathologist-derived annotations for detection and segmentation on 80 and 63 images, respectively. For the crowdsourced annotations, we evaluated performance across a range of contributor skill levels (1, 2, or 3). The crowdsourced annotations (4,860 images in total) were completed in only a fraction of the time and cost required for obtaining annotations using traditional methods. For the nucleus detection task, the research fellow-derived annotations showed the strongest concordance with the expert pathologist-derived annotations (F−M =93.68%), followed by the crowd-sourced contributor levels 1,2, and 3 and the automated method, which showed relatively similar performance (F−M = 87.84%, 88.49%, 87.26%, and 86.99%, respectively). For the nucleus segmentation task, the crowdsourced contributor level 3-derived annotations, research fellow-derived annotations, and automated method showed the strongest concordance with the expert pathologist

  14. Genome annotation in a community college cell biology lab.

    PubMed

    Beagley, C Timothy

    2013-01-01

    The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning skills. Additionally, the project strengthens student understanding of the scientific method and contributes to student learning gains in curricular objectives centered around basic molecular biology, specifically, the Central Dogma. Importantly, inclusion of this project in the laboratory course provides students with a positive learning environment and allows for the use of cooperative learning strategies to increase overall student success.

  15. Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence

    Technology Transfer Automated Retrieval System (TEKTRAN)

    An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions fr...

  16. Coordinated international action to accelerate genome-to-phenome with FAANG, The Functional Annotation of Animal Genomes project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  17. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome

    PubMed Central

    Bergman, Casey M; Pfeiffer, Barret D; Rincón-Limas, Diego E; Hoskins, Roger A; Gnirke, Andreas; Mungall, Chris J; Wang, Adrienne M; Kronmiller, Brent; Pacleb, Joanne; Park, Soo; Stapleton, Mark; Wan, Kenneth; George, Reed A; de Jong, Pieter J; Botas, Juan; Rubin, Gerald M; Celniker, Susan E

    2002-01-01

    Background It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most informative species and features of genome evolution for comparison remain to be determined. Results We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D. willistoni, and D. littoralis) covering more than 500 kb of the D. melanogaster genome. All D. melanogaster genes (and 78-82% of coding exons) identified in divergent species such as D. pseudoobscura show evidence of functional constraint. Addition of a third species can reveal functional constraint in otherwise non-significant pairwise exon comparisons. Microsynteny is largely conserved, with rearrangement breakpoints, novel transposable element insertions, and gene transpositions occurring in similar numbers. Rates of amino-acid substitution are higher in uncharacterized genes relative to genes that have previously been studied. Conserved non-coding sequences (CNCSs) tend to be spatially clustered with conserved spacing between CNCSs, and clusters of CNCSs can be used to predict enhancer sequences. Conclusions Our results provide the basis for choosing species whose genome sequences would be most useful in aiding the functional annotation of coding and cis-regulatory sequences in Drosophila. Furthermore, this work shows how decoding the spatial organization of conserved sequences, such as the clustering of CNCSs, can complement efforts to annotate eukaryotic genomes on the basis of sequence conservation alone. PMID:12537575

  18. MitoFish and MitoAnnotator: a mitochondrial genome database of fish with an accurate and automatic annotation pipeline.

    PubMed

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-11-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

  19. MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

    PubMed Central

    Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

    2013-01-01

    Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518

  20. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  1. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  2. Australian sea-floor survey data, with images and expert annotations.

    PubMed

    Bewley, Michael; Friedman, Ariell; Ferrari, Renata; Hill, Nicole; Hovey, Renae; Barrett, Neville; Pizarro, Oscar; Figueira, Will; Meyer, Lisa; Babcock, Russ; Bellchambers, Lynda; Byrne, Maria; Williams, Stefan B

    2015-01-01

    This Australian benthic data set (BENTHOZ-2015) consists of an expert-annotated set of georeferenced benthic images and associated sensor data, captured by an autonomous underwater vehicle (AUV) around Australia. This type of data is of interest to marine scientists studying benthic habitats and organisms. AUVs collect georeferenced images over an area with consistent illumination and altitude, and make it possible to generate broad scale, photo-realistic 3D maps. Marine scientists then typically spend several minutes on each of thousands of images, labeling substratum type and biota at a subset of points. Labels from four Australian research groups were combined using the CATAMI classification scheme, a hierarchical classification scheme based on taxonomy and morphology for scoring marine imagery. This data set consists of 407,968 expert labeled points from around the Australian coast, with associated images, geolocation and other sensor data. The robotic surveys that collected this data form part of Australia's Integrated Marine Observing System (IMOS) ongoing benthic monitoring program. There is reuse potential in marine science, robotics, and computer vision research.

  3. Australian sea-floor survey data, with images and expert annotations

    PubMed Central

    Bewley, Michael; Friedman, Ariell; Ferrari, Renata; Hill, Nicole; Hovey, Renae; Barrett, Neville; Pizarro, Oscar; Figueira, Will; Meyer, Lisa; Babcock, Russ; Bellchambers, Lynda; Byrne, Maria; Williams, Stefan B.

    2015-01-01

    This Australian benthic data set (BENTHOZ-2015) consists of an expert-annotated set of georeferenced benthic images and associated sensor data, captured by an autonomous underwater vehicle (AUV) around Australia. This type of data is of interest to marine scientists studying benthic habitats and organisms. AUVs collect georeferenced images over an area with consistent illumination and altitude, and make it possible to generate broad scale, photo-realistic 3D maps. Marine scientists then typically spend several minutes on each of thousands of images, labeling substratum type and biota at a subset of points. Labels from four Australian research groups were combined using the CATAMI classification scheme, a hierarchical classification scheme based on taxonomy and morphology for scoring marine imagery. This data set consists of 407,968 expert labeled points from around the Australian coast, with associated images, geolocation and other sensor data. The robotic surveys that collected this data form part of Australia's Integrated Marine Observing System (IMOS) ongoing benthic monitoring program. There is reuse potential in marine science, robotics, and computer vision research. PMID:26528396

  4. The discrepancies in the results of bioinformatics tools for genomic structural annotation

    NASA Astrophysics Data System (ADS)

    Pawełkowicz, Magdalena; Nowak, Robert; Osipowski, Paweł; Rymuszka, Jacek; Świerkula, Katarzyna; Wojcieszek, Michał; Przybecki, Zbigniew

    2014-11-01

    A major focus of sequencing project is to identify genes in genomes. However it is necessary to define the variety of genes and the criteria for identifying them. In this work we present discrepancies and dependencies from the application of different bioinformatic programs for structural annotation performed on the cucumber data set from Polish Consortium of Cucumber Genome Sequencing. We use Fgenesh, GenScan and GeneMark to automated structural annotation, the results have been compared to reference annotation.

  5. AGeS: A Software System for Microbial Genome Sequence Annotation

    DTIC Science & Technology

    2011-03-01

    source genome browser GBrowse [4] and a link to download annotated genomes in GenBank [14] format; and N free availability of the source code. Methods... ORPHEUS [43], WUBLAST [44], and FASTA [45] for predicting protein-coding regions and some InterPro databases for function annotation. In addition, Y

  6. GO-FAANG meeting: A gathering on functional annotation of animal genomes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The FAANG (Functional Annotation of Animal Genomes) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of non-model organisms (www.faang.or...

  7. GAMOLA2, a Comprehensive Software Package for the Annotation and Curation of Draft and Complete Microbial Genomes.

    PubMed

    Altermann, Eric; Lu, Jingli; McCulloch, Alan

    2017-01-01

    Expert curated annotation remains one of the critical steps in achieving a reliable biological relevant annotation. Here we announce the release of GAMOLA2, a user friendly and comprehensive software package to process, annotate and curate draft and complete bacterial, archaeal, and viral genomes. GAMOLA2 represents a wrapping tool to combine gene model determination, functional Blast, COG, Pfam, and TIGRfam analyses with structural predictions including detection of tRNAs, rRNA genes, non-coding RNAs, signal protein cleavage sites, transmembrane helices, CRISPR repeats and vector sequence contaminations. GAMOLA2 has already been validated in a wide range of bacterial and archaeal genomes, and its modular concept allows easy addition of further functionality in future releases. A modified and adapted version of the Artemis Genome Viewer (Sanger Institute) has been developed to leverage the additional features and underlying information provided by the GAMOLA2 analysis, and is part of the software distribution. In addition to genome annotations, GAMOLA2 features, among others, supplemental modules that assist in the creation of custom Blast databases, annotation transfers between genome versions, and the preparation of Genbank files for submission via the NCBI Sequin tool. GAMOLA2 is intended to be run under a Linux environment, whereas the subsequent visualization and manual curation in Artemis is mobile and platform independent. The development of GAMOLA2 is ongoing and community driven. New functionality can easily be added upon user requests, ensuring that GAMOLA2 provides information relevant to microbiologists. The software is available free of charge for academic use.

  8. GAMOLA2, a Comprehensive Software Package for the Annotation and Curation of Draft and Complete Microbial Genomes

    PubMed Central

    Altermann, Eric; Lu, Jingli; McCulloch, Alan

    2017-01-01

    Expert curated annotation remains one of the critical steps in achieving a reliable biological relevant annotation. Here we announce the release of GAMOLA2, a user friendly and comprehensive software package to process, annotate and curate draft and complete bacterial, archaeal, and viral genomes. GAMOLA2 represents a wrapping tool to combine gene model determination, functional Blast, COG, Pfam, and TIGRfam analyses with structural predictions including detection of tRNAs, rRNA genes, non-coding RNAs, signal protein cleavage sites, transmembrane helices, CRISPR repeats and vector sequence contaminations. GAMOLA2 has already been validated in a wide range of bacterial and archaeal genomes, and its modular concept allows easy addition of further functionality in future releases. A modified and adapted version of the Artemis Genome Viewer (Sanger Institute) has been developed to leverage the additional features and underlying information provided by the GAMOLA2 analysis, and is part of the software distribution. In addition to genome annotations, GAMOLA2 features, among others, supplemental modules that assist in the creation of custom Blast databases, annotation transfers between genome versions, and the preparation of Genbank files for submission via the NCBI Sequin tool. GAMOLA2 is intended to be run under a Linux environment, whereas the subsequent visualization and manual curation in Artemis is mobile and platform independent. The development of GAMOLA2 is ongoing and community driven. New functionality can easily be added upon user requests, ensuring that GAMOLA2 provides information relevant to microbiologists. The software is available free of charge for academic use. PMID:28386247

  9. Sequencing and annotation of the Ophiostoma ulmi genome

    PubMed Central

    2013-01-01

    Background The ascomycete fungus Ophiostoma ulmi was responsible for the initial pandemic of the massively destructive Dutch elm disease in Europe and North America in early 1910. Dutch elm disease has ravaged the elm tree population globally and is a major threat to the remaining elm population. O. ulmi is also associated with valuable biomaterials applications. It was recently discovered that proteins from O. ulmi can be used for efficient transformation of amylose in the production of bioplastics. Results We have sequenced the 31.5 Mb genome of O.ulmi using Illumina next generation sequencing. Applying both de novo and comparative genome annotation methods, we predict a total of 8639 gene models. The quality of the predicted genes was validated using a variety of data sources consisting of EST data, mRNA-seq data and orthologs from related fungal species. Sequence-based computational methods were used to identify candidate virulence-related genes. Metabolic pathways were reconstructed and highlight specific enzymes that may play a role in virulence. Conclusions This genome sequence will be a useful resource for further research aimed at understanding the molecular mechanisms of pathogenicity by O. ulmi. It will also facilitate the identification of enzymes necessary for industrial biotransformation applications. PMID:23496816

  10. Apollo: a community resource for genome annotation editing

    PubMed Central

    Ed, Lee; Nomi, Harris; Mark, Gibson; Raymond, Chetty; Suzanna, Lewis

    2009-01-01

    Summary: Apollo is a genome annotation-editing tool with an easy to use graphical interface. It is a component of the GMOD project, with ongoing development driven by the community. Recent additions to the software include support for the generic feature format version 3 (GFF3), continuous transcriptome data, a full Chado database interface, integration with remote services for on-the-fly BLAST and Primer BLAST analyses, graphical interfaces for configuring user preferences and full undo of all edit operations. Apollo's user community continues to grow, including its use as an educational tool for college and high-school students. Availability: Apollo is a Java application distributed under a free and open source license. Installers for Windows, Linux, Unix, Solaris and Mac OS X are available at http://apollo.berkeleybop.org, and the source code is available from the SourceForge CVS repository at http://gmod.cvs.sourceforge.net/gmod/apollo. Contact: elee@berkeleybop.org PMID:19439563

  11. VESPA: Software to Facilitate Genomic Annotation of Prokaryotic Organisms Through Integration of Proteomic and Transcriptomic Data

    SciTech Connect

    Peterson, Elena S.; McCue, Lee Ann; Rutledge, Alexandra C.; Jensen, Jeffrey L.; Walker, Julia; Kobold, Mark A.; Webb, Samantha R.; Payne, Samuel H.; Ansong, Charles; Adkins, Joshua N.; Cannon, William R.; Webb-Robertson, Bobbie-Jo M.

    2012-04-25

    Visual Exploration and Statistics to Promote Annotation (VESPA) is an interactive visual analysis software tool that facilitates the discovery of structural mis-annotations in prokaryotic genomes. VESPA integrates high-throughput peptide-centric proteomics data and oligo-centric or RNA-Seq transcriptomics data into a genomic context. The data may be interrogated via visual analysis across multiple levels of genomic resolution, linked searches, exports and interaction with BLAST to rapidly identify location of interest within the genome and evaluate potential mis-annotations.

  12. MEGANTE: a web-based system for integrated plant genome annotation.

    PubMed

    Numa, Hisataka; Itoh, Takeshi

    2014-01-01

    The recent advancement of high-throughput genome sequencing technologies has resulted in a considerable increase in demands for large-scale genome annotation. While annotation is a crucial step for downstream data analyses and experimental studies, this process requires substantial expertise and knowledge of bioinformatics. Here we present MEGANTE, a web-based annotation system that makes plant genome annotation easy for researchers unfamiliar with bioinformatics. Without any complicated configuration, users can perform genomic sequence annotations simply by uploading a sequence and selecting the species to query. MEGANTE automatically runs several analysis programs and integrates the results to select the appropriate consensus exon-intron structures and to predict open reading frames (ORFs) at each locus. Functional annotation, including a similarity search against known proteins and a functional domain search, are also performed for the predicted ORFs. The resultant annotation information is visualized with a widely used genome browser, GBrowse. For ease of analysis, the results can be downloaded in Microsoft Excel format. All of the query sequences and annotation results are stored on the server side so that users can access their own data from virtually anywhere on the web. The current release of MEGANTE targets 24 plant species from the Brassicaceae, Fabaceae, Musaceae, Poaceae, Salicaceae, Solanaceae, Rosaceae and Vitaceae families, and it allows users to submit a sequence up to 10 Mb in length and to save up to 100 sequences with the annotation information on the server. The MEGANTE web service is available at https://megante.dna.affrc.go.jp/.

  13. Genome-enabled prediction of quantitative traits in chickens using genomic annotation

    PubMed Central

    2014-01-01

    Background Genome-wide association studies have been deemed successful for identifying statistically associated genetic variants of large effects on complex traits. Past studies have found enrichment of trait-associated SNPs in functionally annotated regions, while depletion was reported for intergenic regions (IGR). However, no systematic examination of connections between genomic regions and predictive ability of complex phenotypes has been carried out. Results In this study, we partitioned SNPs based on their annotation to characterize genomic regions that deliver low and high predictive power for three broiler traits in chickens using a whole-genome approach. Additive genomic relationship kernels were constructed for each of the genic regions considered, and a kernel-based Bayesian ridge regression was employed as prediction machine. We found that the predictive performance for ultrasound area of breast meat from using genic regions marked by SNPs was consistently better than that from SNPs in IGR, while IGR tagged by SNPs were better than the genic regions for body weight and hen house egg production. We also noted that predictive ability delivered by the whole battery of markers was close to the best prediction achieved by one of the genomic regions. Conclusions Whole-genome regression methods use all available quality filtered SNPs into a model, contrary to accommodating only validated SNPs from exonic or coding regions. Our results suggest that, while differences among genomic regions in terms of predictive ability were observed, the whole-genome approach remains as a promising tool if interest is on prediction of complex traits. PMID:24502227

  14. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  15. snpGeneSets: An R Package for Genome-Wide Study Annotation.

    PubMed

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-12-07

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/.

  16. Comparative genomics in cyprinids: common carp ESTs help the annotation of the zebrafish genome

    PubMed Central

    Christoffels, Alan; Bartfai, Richard; Srinivasan, Hamsa; Komen, Hans; Orban, Laszlo

    2006-01-01

    Background Automatic annotation of sequenced eukaryotic genomes integrates a combination of methodologies such as ab-initio methods and alignment of homologous genes and/or proteins. For example, annotation of the zebrafish genome within Ensembl relies heavily on available cDNA and protein sequences from two distantly related fish species and other vertebrates that have diverged several hundred million years ago. The scarcity of genomic information from other cyprinids provides the impetus to leverage EST collections to understand gene structures in this diverse teleost group. Results We have generated 6,050 ESTs from the differentiating testis of common carp (Cyprinus carpio) and clustered them with 9,303 non-gonadal ESTs from CarpBase as well as 1,317 ESTs and 652 common carp mRNAs from GenBank. Over 28% of the resulting 8,663 unique transcripts are exclusively testis-derived ESTs. Moreover, 974 of these transcripts did not match any sequence in the zebrafish or fathead minnow EST collection. A total of 1,843 unique common carp sequences could be stringently mapped to the zebrafish genome (version 5), of which 1,752 matched coding sequences of zebrafish genes with or without potential splice variants. We show that 91 common carp transcripts map to intergenic and intronic regions on the zebrafish genome assembly and regions annotated with non-teleost sequences. Interestingly, an additional 42 common carp transcripts indicate the potential presence of new splicing variants not found in zebrafish databases so far. The fact that common carp transcripts help the identification or confirmation of these coding regions in zebrafish exemplifies the usefulness of sequences from closely related species for the annotation of model genomes. We also demonstrate that 5' UTR sequences of common carp and zebrafish orthologs share a significant level of similarity based on preservation of motif arrangements for as many as 10 ab-initio motifs. Conclusion Our data show that there is

  17. AGeS: A Software System for Microbial Genome Sequence Annotation

    DTIC Science & Technology

    2011-03-07

    GBrowse [4] and a link to download annotated genomes in GenBank [14] format; and N free availability of the source code. Methods The AGeS system was...Repraze [39], HMMER [40], RFAM [41], and InterPro [42]. In the original annotation of Y. pestis CO92 [25], the Sanger Institute used ORPHEUS [43], WUBLAST

  18. Toward a standard in structural genome annotation for prokaryotes

    DOE PAGES

    Tripp, H. James; Sutton, Granger; White, Owen; ...

    2015-07-25

    In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, we collected 1,004,576 peptides from various publicly available resources, and these were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes. We found thatmore » the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study. A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of experimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.« less

  19. Toward a standard in structural genome annotation for prokaryotes

    SciTech Connect

    Tripp, H. James; Sutton, Granger; White, Owen; Wortman, Jennifer; Pati, Amrita; Mikhailova, Natalia; Ovchinnikova, Galina; Payne, Samuel H.; Kyrpides, Nikos C.; Ivanova, Natalia

    2015-07-25

    In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, we collected 1,004,576 peptides from various publicly available resources, and these were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes. We found that the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study. A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of experimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.

  20. Improvement of barley genome annotations by deciphering the Haruna Nijo genome.

    PubMed

    Sato, Kazuhiro; Tanaka, Tsuyoshi; Shigenobu, Shuji; Motoi, Yuka; Wu, Jianzhong; Itoh, Takeshi

    2016-02-01

    Full-length (FL) cDNA sequences provide the most reliable evidence for the presence of genes in genomes. In this report, detailed gene structures of barley, whole genome shotgun (WGS) and additional transcript data of the cultivar Haruna Nijo were quality controlled and compared with the published Morex genome information. Haruna Nijo scaffolds have longer total sequence length with much higher N50 and fewer sequences than those in Morex WGS contigs. The longer Haruna Nijo scaffolds provided efficient FLcDNA mapping, resulting in high coverage and detection of the transcription start sites. In combination with FLcDNAs and RNA-Seq data from four different tissue samples of Haruna Nijo, we identified 51,249 gene models on 30,606 loci. Overall sequence similarity between Haruna Nijo and Morex genome was 95.99%, while that of exon regions was higher (99.71%). These sequence and annotation data of Haruna Nijo are combined with Morex genome data and released from a genome browser. The genome sequence of Haruna Nijo may provide detailed gene structures in addition to the current Morex barley genome information.

  1. Proteogenomics: the needs and roles to be filled by proteomics in genome annotation

    SciTech Connect

    Ansong, Charles; Purvine, Samuel O.; Adkins, Joshua N.; Lipton, Mary S.; Smith, Richard D.

    2008-01-01

    While genome sequencing efforts reveal the basic building blocks of life, a genome sequence alone is insufficient for elucidating biological function. Genome annotation – the process of identifying genes and assigning function to each gene in a genome sequence – provides the means to elucidate biological function from sequence. Current state-of-the-art high throughput genome annotation uses a combination of comparative (sequence similarity data) and non-comparative (ab initio gene prediction algorithms) methods to identify open reading frames in genome sequences. Because approaches used to validate the presence of these open reading frames are typically based on the information derived from the annotated genomes, they cannot independently and unequivocally determine whether a predicted open reading frame is translated into a protein. With the ability to directly measure peptides arising from expressed proteins, high throughput liquid chromatography-tandem mass spectrometry-based proteomics, approaches can be used to verify coding regions of a genomic sequence. Here, we highlight several ways in which high throughput tandem mass spectrometry-based proteomics can improve the quality of genome annotations and suggest that it could be efficiently applied during the initial gene calling process so that the improvements are propagated through the subsequent functional annotation process.

  2. Predicting Shine–Dalgarno Sequence Locations Exposes Genome Annotation Errors

    PubMed Central

    Starmer, J; Stomp, A; Vouk, M; Bitzer, D

    2006-01-01

    In prokaryotes, Shine–Dalgarno (SD) sequences, nucleotides upstream from start codons on messenger RNAs (mRNAs) that are complementary to ribosomal RNA (rRNA), facilitate the initiation of protein synthesis. The location of SD sequences relative to start codons and the stability of the hybridization between the mRNA and the rRNA correlate with the rate of synthesis. Thus, accurate characterization of SD sequences enhances our understanding of how an organism's transcriptome relates to its cellular proteome. We implemented the Individual Nearest Neighbor Hydrogen Bond model for oligo–oligo hybridization and created a new metric, relative spacing (RS), to identify both the location and the hybridization potential of SD sequences by simulating the binding between mRNAs and single-stranded 16S rRNA 3′ tails. In 18 prokaryote genomes, we identified 2,420 genes out of 58,550 where the strongest binding in the translation initiation region included the start codon, deviating from the expected location for the SD sequence of five to ten bases upstream. We designated these as RS+1 genes. Additional analysis uncovered an unusual bias of the start codon in that the majority of the RS+1 genes used GUG, not AUG. Furthermore, of the 624 RS+1 genes whose SD sequence was associated with a free energy release of less than −8.4 kcal/mol (strong RS+1 genes), 384 were within 12 nucleotides upstream of in-frame initiation codons. The most likely explanation for the unexpected location of the SD sequence for these 384 genes is mis-annotation of the start codon. In this way, the new RS metric provides an improved method for gene sequence annotation. The remaining strong RS+1 genes appear to have their SD sequences in an unexpected location that includes the start codon. Thus, our RS metric provides a new way to explore the role of rRNA–mRNA nucleotide hybridization in translation initiation. PMID:16710451

  3. Genome-Wide Enzyme Annotation with Precision Control: Catalytic Families (CatFam) Databases

    DTIC Science & Technology

    2008-01-01

    classification, decision trees , association rules, neural networks,18 and support vector machines,19,20 to classify protein catalytic func- tions using various...genomes To evaluate the performance of CatFam for whole ge- nome annotation, we select two Yersinia genomes [Y. pes- tis mediaevails (ypm) and Y...pestis and F. tularensis Organism Annotation Number of enzyme-catalyzed reactions Number of predicted pathways Number of pathways with holes Yersinia

  4. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation

    PubMed Central

    Sharma, Virag; Elghafari, Anas; Hiller, Michael

    2016-01-01

    Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes. PMID:27016733

  5. Annotation methods to develop and evaluate an expert system based on natural language processing in electronic medical records.

    PubMed

    Gicquel, Quentin; Tvardik, Nastassia; Bouvry, Côme; Kergourlay, Ivan; Bittar, André; Segond, Frédérique; Darmoni, Stefan; Metzger, Marie-Hélène

    2015-01-01

    The objective of the SYNODOS collaborative project was to develop a generic IT solution, combining a medical terminology server, a semantic analyser and a knowledge base. The goal of the project was to generate meaningful epidemiological data for various medical domains from the textual content of French medical records. In the context of this project, we built a care pathway oriented conceptual model and corresponding annotation method to develop and evaluate an expert system's knowledge base. The annotation method is based on a semi-automatic process, using a software application (MedIndex). This application exchanges with a cross-lingual multi-termino-ontology portal. The annotator selects the most appropriate medical code proposed for the medical concept in question by the multi-termino-ontology portal and temporally labels the medical concept according to the course of the medical event. This choice of conceptual model and annotation method aims to create a generic database of facts for the secondary use of electronic health records data.

  6. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology

    PubMed Central

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e − 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e − 14) in GeneRIFs and GOA shows our annotation resource is very reliable. PMID:27635398

  7. Annotating the Function of the Human Genome with Gene Ontology and Disease Ontology.

    PubMed

    Hu, Yang; Zhou, Wenyang; Ren, Jun; Dong, Lixiang; Wang, Yadong; Jin, Shuilin; Cheng, Liang

    2016-01-01

    Increasing evidences indicated that function annotation of human genome in molecular level and phenotype level is very important for systematic analysis of genes. In this study, we presented a framework named Gene2Function to annotate Gene Reference into Functions (GeneRIFs), in which each functional description of GeneRIFs could be annotated by a text mining tool Open Biomedical Annotator (OBA), and each Entrez gene could be mapped to Human Genome Organisation Gene Nomenclature Committee (HGNC) gene symbol. After annotating all the records about human genes of GeneRIFs, 288,869 associations between 13,148 mRNAs and 7,182 terms, 9,496 associations between 948 microRNAs and 533 terms, and 901 associations between 139 long noncoding RNAs (lncRNAs) and 297 terms were obtained as a comprehensive annotation resource of human genome. High consistency of term frequency of individual gene (Pearson correlation = 0.6401, p = 2.2e - 16) and gene frequency of individual term (Pearson correlation = 0.1298, p = 3.686e - 14) in GeneRIFs and GOA shows our annotation resource is very reliable.

  8. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

    PubMed

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-05-27

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.

  9. Accurate Transposable Element Annotation Is Vital When Analyzing New Genome Assemblies

    PubMed Central

    Platt, Roy N.; Blanco-Berdugo, Laura; Ray, David A.

    2016-01-01

    Transposable elements (TEs) are mobile genetic elements with the ability to replicate themselves throughout the host genome. In some taxa TEs reach copy numbers in hundreds of thousands and can occupy more than half of the genome. The increasing number of reference genomes from nonmodel species has begun to outpace efforts to identify and annotate TE content and methods that are used vary significantly between projects. Here, we demonstrate variation that arises in TE annotations when less than optimal methods are used. We found that across a variety of taxa, the ability to accurately identify TEs based solely on homology decreased as the phylogenetic distance between the queried genome and a reference increased. Next we annotated repeats using homology alone, as is often the case in new genome analyses, and a combination of homology and de novo methods as well as an additional manual curation step. Reannotation using these methods identified a substantial number of new TE subfamilies in previously characterized genomes, recognized a higher proportion of the genome as repetitive, and decreased the average genetic distance within TE families, implying recent TE accumulation. Finally, these finding—increased recognition of younger TEs—were confirmed via an analysis of the postman butterfly (Heliconius melpomene). These observations imply that complete TE annotation relies on a combination of homology and de novo–based repeat identification, manual curation, and classification and that relying on simple, homology-based methods is insufficient to accurately describe the TE landscape of a newly sequenced genome. PMID:26802115

  10. Accurate Transposable Element Annotation Is Vital When Analyzing New Genome Assemblies.

    PubMed

    Platt, Roy N; Blanco-Berdugo, Laura; Ray, David A

    2016-01-21

    Transposable elements (TEs) are mobile genetic elements with the ability to replicate themselves throughout the host genome. In some taxa TEs reach copy numbers in hundreds of thousands and can occupy more than half of the genome. The increasing number of reference genomes from nonmodel species has begun to outpace efforts to identify and annotate TE content and methods that are used vary significantly between projects. Here, we demonstrate variation that arises in TE annotations when less than optimal methods are used. We found that across a variety of taxa, the ability to accurately identify TEs based solely on homology decreased as the phylogenetic distance between the queried genome and a reference increased. Next we annotated repeats using homology alone, as is often the case in new genome analyses, and a combination of homology and de novo methods as well as an additional manual curation step. Reannotation using these methods identified a substantial number of new TE subfamilies in previously characterized genomes, recognized a higher proportion of the genome as repetitive, and decreased the average genetic distance within TE families, implying recent TE accumulation. Finally, these finding-increased recognition of younger TEs-were confirmed via an analysis of the postman butterfly (Heliconius melpomene). These observations imply that complete TE annotation relies on a combination of homology and de novo-based repeat identification, manual curation, and classification and that relying on simple, homology-based methods is insufficient to accurately describe the TE landscape of a newly sequenced genome.

  11. Using Microbial Genome Annotation as a Foundation for Collaborative Student Research

    ERIC Educational Resources Information Center

    Reed, Kelynne E.; Richardson, John M.

    2013-01-01

    We used the Integrated Microbial Genomes Annotation Collaboration Toolkit as a framework to incorporate microbial genomics research into a microbiology and biochemistry course in a way that promoted student learning of bioinformatics and research skills and emphasized teamwork and collaboration as evidenced through multiple assessment mechanisms.…

  12. Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome

    NASA Astrophysics Data System (ADS)

    Ernst, Jason; Kellis, Manolis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

  13. The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes

    PubMed Central

    Estill, James C; Bennetzen, Jeffrey L

    2009-01-01

    Background High quality annotation of the genes and transposable elements in complex genomes requires a human-curated integration of multiple sources of computational evidence. These evidences include results from a diversity of ab initio prediction programs as well as homology-based searches. Most of these programs operate on a single contiguous sequence at a time, and the results are generated in a diverse array of readable formats that must be translated to a standardized file format. These translated results must then be concatenated into a single source, and then presented in an integrated form for human curation. Results We have designed, implemented, and assessed a Perl-based workflow named DAWGPAWS for the generation of computational results for human curation of the genes and transposable elements in plant genomes. The use of DAWGPAWS was found to accelerate annotation of 80–200 kb wheat DNA inserts in bacterial artificial chromosome (BAC) vectors by approximately twenty-fold and to also significantly improve the quality of the annotation in terms of completeness and accuracy. Conclusion The DAWGPAWS genome annotation pipeline fills an important need in the annotation of plant genomes by generating computational evidences in a high throughput manner, translating these results to a common file format, and facilitating the human curation of these computational results. We have verified the value of DAWGPAWS by using this pipeline to annotate the genes and transposable elements in 220 BAC insertions from the hexaploid wheat genome (Triticum aestivum L.). DAWGPAWS can be applied to annotation efforts in other plant genomes with minor modifications of program-specific configuration files, and the modular design of the workflow facilitates integration into existing pipelines. PMID:19545381

  14. A spectral approach integrating functional genomic annotations for coding and noncoding variants.

    PubMed

    Ionita-Laza, Iuliana; McCallum, Kenneth; Xu, Bin; Buxbaum, Joseph D

    2016-02-01

    Over the past few years, substantial effort has been put into the functional annotation of variation in human genome sequences. Such annotations can have a critical role in identifying putatively causal variants for a disease or trait among the abundant natural variation that occurs at a locus of interest. The main challenges in using these various annotations include their large numbers and their diversity. Here we develop an unsupervised approach to integrate these different annotations into one measure of functional importance (Eigen) that, unlike most existing methods, is not based on any labeled training data. We show that the resulting meta-score has better discriminatory ability using disease-associated and putatively benign variants from published studies (in both coding and noncoding regions) than the recently proposed CADD score. Across varied scenarios, the Eigen score performs generally better than any single individual annotation, representing a powerful single functional score that can be incorporated in fine-mapping studies.

  15. Creating reference gene annotation for the mouse C57BL6/J genome assembly.

    PubMed

    Mudge, Jonathan M; Harrow, Jennifer

    2015-10-01

    Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.

  16. RefSeq microbial genomes database: new representation and annotation strategy

    PubMed Central

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O’Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks. PMID:24316578

  17. RefSeq microbial genomes database: new representation and annotation strategy.

    PubMed

    Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor

    2014-01-01

    The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.

  18. Companion: a web server for annotation and analysis of parasite genomes

    PubMed Central

    Steinbiss, Sascha; Silva-Franco, Fatima; Brunk, Brian; Foth, Bernardo; Hertz-Fowler, Christiane; Berriman, Matthew; Otto, Thomas D.

    2016-01-01

    Currently available sequencing technologies enable quick and economical sequencing of many new eukaryotic parasite (apicomplexan or kinetoplastid) species or strains. Compared to SNP calling approaches, de novo assembly of these genomes enables researchers to additionally determine insertion, deletion and recombination events as well as to detect complex sequence diversity, such as that seen in variable multigene families. However, there currently are no automated eukaryotic annotation pipelines offering the required range of results to facilitate such analyses. A suitable pipeline needs to perform evidence-supported gene finding as well as functional annotation and pseudogene detection up to the generation of output ready to be submitted to a public database. Moreover, no current tool includes quick yet informative comparative analyses and a first pass visualization of both annotation and analysis results. To overcome those needs we have developed the Companion web server (http://companion.sanger.ac.uk) providing parasite genome annotation as a service using a reference-based approach. We demonstrate the use and performance of Companion by annotating two Leishmania and Plasmodium genomes as typical parasite cases and evaluate the results compared to manually annotated references. PMID:27105845

  19. DFAST and DAGA: web-based integrated genome annotation tools and resources

    PubMed Central

    TANIZAWA, Yasuhiro; FUJISAWA, Takatomo; KAMINUMA, Eli; NAKAMURA, Yasukazu; ARITA, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp. PMID:27867804

  20. DFAST and DAGA: web-based integrated genome annotation tools and resources.

    PubMed

    Tanizawa, Yasuhiro; Fujisawa, Takatomo; Kaminuma, Eli; Nakamura, Yasukazu; Arita, Masanori

    2016-01-01

    Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp.

  1. GFFview: A Web Server for Parsing and Visualizing Annotation Information of Eukaryotic Genome.

    PubMed

    Deng, Feilong; Chen, Shi-Yi; Wu, Zhou-Lin; Hu, Yongsong; Jia, Xianbo; Lai, Song-Jia

    2017-03-29

    Owing to wide application of RNA sequencing (RNA-seq) technology, more and more eukaryotic genomes have been extensively annotated, such as the gene structure, alternative splicing, and noncoding loci. Annotation information of genome is prevalently stored as plain text in General Feature Format (GFF), which could be hundreds or thousands Mb in size. Therefore, it is a challenge for manipulating GFF file for biologists who have no bioinformatic skill. In this study, we provide a web server (GFFview) for parsing the annotation information of eukaryotic genome and then generating statistical description of six indices for visualization. GFFview is very useful for investigating quality and difference of the de novo assembled transcriptome in RNA-seq studies.

  2. Enhancing Structural Annotation of Yeast Genomes with RNA-Seq Data.

    PubMed

    Devillers, Hugo; Morin, Nicolas; Neuvéglise, Cécile

    2016-01-01

    The number of fully sequenced genomes of yeasts is dramatically increasing but both structural and functional annotation quality are usually neglected, as most frequently based on automatic annotation transfer tools from reference genomes. RNA sequencing technologies offer the possibility to better characterize yeast transcriptomes and to correct or improve the prediction of mRNA, ncRNA, or miscellaneous RNA. We describe a computational approach to enhance structural annotation of yeast genomes based on RNA-Seq data exploitation. The proposed pipeline is primarily based on read mapping with TopHat2. Mapping outputs are then used for various applications such as: (1) validation of exon-exon junctions of predicted transcripts, (2) definition of new transcribed features, (3) prediction of 3' UTR, and (4) identification of extra features absent from the genome assembly. We strongly encourage curators to proceed to a manual validation and editing of the reference genome. Releasing genomes with high-quality annotation is an important issue, as they will be considered as references for further predictions.

  3. Genome-wide functional annotation of Phomopsis longicolla isolate MSPL 10-6.

    PubMed

    Darwish, Omar; Li, Shuxian; Matthews, Benjamin; Alkharouf, Nadim

    2016-06-01

    Phomopsis seed decay of soybean is caused primarily by the seed-borne fungal pathogen Phomopsis longicolla (syn. Diaporthe longicolla). This disease severely decreases soybean seed quality, reduces seedling vigor and stand establishment, and suppresses yield. It is one of the most economically important soybean diseases. In this study we annotated the entire genome of P. longicolla isolate MSPL 10-6, which was isolated from field-grown soybean seed in Mississippi, USA. This study represents the first reported genome-wide functional annotation of a seed borne fungal pathogen in the Diaporthe-Phomopsis complex. The P. longicolla genome annotation will enable research into the genetic basis of fungal infection of soybean seed and provide information for the study of soybean-fungal interactions. The genome annotation will also be a valuable resource for the research and agricultural communities. It will aid in the development of new control strategies for this pathogen. The annotations can be found from: http://bioinformatics.towson.edu/phomopsis_longicolla/download.html. NCBI accession number is: AYRD00000000.

  4. Genome assembly and annotation of a Drosophila simulans strain from Madagascar.

    PubMed

    Palmieri, Nicola; Nolte, Viola; Chen, Jun; Schlötterer, Christian

    2015-03-01

    Drosophila simulans is a close relative of the genetic model D. melanogaster. Its worldwide distribution in combination with the absence of segregating chromosomal inversions makes this species an increasingly attractive model to study the molecular signatures of adaptation in natural and experimental populations. In an effort to improve the genomic resources for D. simulans, we assembled and annotated the genome of a strain originating from Madagascar (M252), the ancestral range of D. simulans. The comparison of the M252 genome to other available D. simulans assemblies confirmed its high quality, but also highlighted genomic regions that are difficult to assemble with NGS data. The annotation of M252 provides a clear improvement with alternative splicing for 52% of the multiple-exon genes, UTRs for 70% of the genes, 225 novel genes and 781 pseudogenes being reported. We anticipate that the M252 genome will be a valuable resource for many research questions.

  5. Annotation of the zebrafish genome through an integrated transcriptomic and proteomic analysis.

    PubMed

    Kelkar, Dhanashree S; Provost, Elayne; Chaerkady, Raghothama; Muthusamy, Babylakshmi; Manda, Srikanth S; Subbannayya, Tejaswini; Selvan, Lakshmi Dhevi N; Wang, Chieh-Huei; Datta, Keshava K; Woo, Sunghee; Dwivedi, Sutopa B; Renuse, Santosh; Getnet, Derese; Huang, Tai-Chung; Kim, Min-Sik; Pinto, Sneha M; Mitchell, Christopher J; Madugundu, Anil K; Kumar, Praveen; Sharma, Jyoti; Advani, Jayshree; Dey, Gourav; Balakrishnan, Lavanya; Syed, Nazia; Nanjappa, Vishalakshi; Subbannayya, Yashwanth; Goel, Renu; Prasad, T S Keshava; Bafna, Vineet; Sirdeshmukh, Ravi; Gowda, Harsha; Wang, Charles; Leach, Steven D; Pandey, Akhilesh

    2014-11-01

    Accurate annotation of protein-coding genes is one of the primary tasks upon the completion of whole genome sequencing of any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate and improve the existing zebrafish genome annotation. We undertook high-resolution mass-spectrometry-based proteomic profiling of 10 adult organs, whole adult fish body, and two developmental stages of zebrafish (SAT line), in addition to transcriptomic profiling of six organs. More than 7,000 proteins were identified from proteomic analyses, and ∼ 69,000 high-confidence transcripts were assembled from the RNA sequencing data. Approximately 15% of the transcripts mapped to intergenic regions, the majority of which are likely long non-coding RNAs. These high-quality transcriptomic and proteomic data were used to manually reannotate the zebrafish genome. We report the identification of 157 novel protein-coding genes. In addition, our data led to modification of existing gene structures including novel exons, changes in exon coordinates, changes in frame of translation, translation in annotated UTRs, and joining of genes. Finally, we discovered four instances of genome assembly errors that were supported by both proteomic and transcriptomic data. Our study shows how an integrative analysis of the transcriptome and the proteome can extend our understanding of even well-annotated genomes.

  6. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations.

    PubMed

    Campbell, Michael S; Law, MeiYee; Holt, Carson; Stein, Joshua C; Moghe, Gaurav D; Hufnagel, David E; Lei, Jikai; Achawanantakun, Rujira; Jiao, Dian; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Childs, Kevin L; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2014-02-01

    We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds.

  7. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  8. Whole genome de novo sequencing and genome annotation of the world popular cultivated edible mushroom, Lentinula edodes.

    PubMed

    Shim, Donghwan; Park, Sin-Gi; Kim, Kangmin; Bae, Wonsil; Lee, Gir Won; Ha, Byeong-Suk; Ro, Hyeon-Su; Kim, Myungkil; Ryoo, Rhim; Rhee, Sung-Keun; Nou, Ill-Sup; Koo, Chang-Duck; Hong, Chang Pyo; Ryu, Hojin

    2016-04-10

    Lentinula edodes, the popular shiitake mushroom, is one of the most important cultivated edible mushrooms. It is used as a food and for medicinal purposes. Here, we present the 46.1 Mb draft genome of L. edodes, comprising 13,028 predicted gene models. The genome assembly consists of 31 scaffolds. Gene annotation provides key information about various signaling pathways and secondary metabolites. This genomic information should help establish the molecular genetic markers for MAS/MAB and increase our understanding of the genome structure and function.

  9. Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure.

    PubMed

    Promponas, Vasilis J; Iliopoulos, Ioannis; Ouzounis, Christos A

    2015-01-01

    The function annotation process in computational biology has increasingly shifted from the traditional characterization of individual biochemical roles of protein molecules to the system-wide detection of entire metabolic pathways and genomic structures. The so-called genome-aware methods broaden misannotation inconsistencies in genome sequences beyond protein function assignments, encompassing phylogenetic anomalies and artifactual genomic regions. We outline three categories of error propagation in databases by providing striking examples - at various levels of appreciation by the community from traditional to emerging, thus raising awareness for future solutions.

  10. ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets

    PubMed Central

    Killion, Patrick J; Iyer, Vishwanath R

    2008-01-01

    ArrayPlex is a software package that centrally provides a large number of flexible toolsets useful for functional genomics, including microarray data storage, quality assessments, data visualization, gene annotation retrieval, statistical tests, genomic sequence retrieval and motif analysis. It uses a client-server architecture based on open source components, provides graphical, command-line, and programmatic access to all needed resources, and is extensible by virtue of a documented application programming interface. ArrayPlex is available at . PMID:19014503

  11. Updated Reference Genome Sequence and Annotation of Mycobacterium bovis AF2122/97

    PubMed Central

    Farrell, Damien; Stuber, Tod P.; Schubert, Olga T.; Aebersold, Ruedi; Robbe-Austerman, Suelee

    2017-01-01

    ABSTRACT We report here an update to the reference genome sequence of the bovine tuberculosis bacillus Mycobacterium bovis AF2122/97, generated using an integrative multiomics approach. The update includes 42 new coding sequences (CDSs), 14 modified annotations, 26 single-nucleotide polymorphism (SNP) corrections, and disclosure that the RD900 locus, previously described as absent from the genome, is in fact present. PMID:28385856

  12. The Development of PIPA: An Integrated and Automated Pipeline for Genome-Wide Protein Function Annotation

    DTIC Science & Technology

    2008-01-25

    bacterial genome annotation. Nucleic Acids Res 2005/06/28 edition. 2005, 33(Web Server issue):W455-9. 14. Meyer F, Goesmann A, McHardy AC, Bartels D...GO.evi dence.shtml] 37. Jain AK Murthy MN, Flynn PJ: Data Clustering: A Review. ACM Computing Surveys 1999, 31(3):264-323. 38. Chenna R, Sugawara H

  13. Toward an Upgraded Honey Bee (Apis mellifera L.) Genome Annotation Using Proteogenomics.

    PubMed

    McAfee, Alison; Harpur, Brock A; Michaud, Sarah; Beavis, Ronald C; Kent, Clement F; Zayed, Amro; Foster, Leonard J

    2016-02-05

    The honey bee is a key pollinator in agricultural operations as well as a model organism for studying the genetics and evolution of social behavior. The Apis mellifera genome has been sequenced and annotated twice over, enabling proteomics and functional genomics methods for probing relevant aspects of their biology. One troubling trend that emerged from proteomic analyses is that honey bee peptide samples consistently result in lower peptide identification rates compared with other organisms. This suggests that the genome annotation can be improved, or atypical biological processes are interfering with the mass spectrometry workflow. First, we tested whether high levels of polymorphisms could explain some of the missed identifications by searching spectra against the reference proteome (OGSv3.2) versus a customized proteome of a single honey bee, but our results indicate that this contribution was minor. Likewise, error-tolerant peptide searches lead us to eliminate unexpected post-translational modifications as a major factor in missed identifications. We then used a proteogenomic approach with ~1500 raw files to search for missing genes and new exons, to revive discarded annotations and to identify over 2000 new coding regions. These results will contribute to a more comprehensive genome annotation and facilitate continued research on this important insect.

  14. Genome Sequence and Annotation of Colletotrichum higginsianum, a Causal Agent of Crucifer Anthracnose Disease

    PubMed Central

    Zampounis, Antonios; Pigné, Sandrine; Dallery, Jean-Félix; Wittenberg, Alexander H. J.; Zhou, Shiguo; Schwartz, David C.

    2016-01-01

    Colletotrichum higginsianum is an ascomycete fungus causing anthracnose disease on numerous cultivated plants in the family Brassicaceae, as well as the model plant Arabidopsis thaliana. We report an assembly of the nuclear genome and gene annotation of this pathogen, which was obtained using a combination of PacBio long-read sequencing and optical mapping. PMID:27540062

  15. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216.

    PubMed

    Franco, Mario E E; López, Silvina; Medina, Rocio; Saparrat, Mario C N; Balatti, Pedro

    2015-09-24

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation.

  16. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216

    PubMed Central

    Franco, Mario E. E.; López, Silvina; Medina, Rocio; Saparrat, Mario C. N.

    2015-01-01

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation. PMID:26404600

  17. Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.

    PubMed

    Lees, Jonathan G; Lee, David; Studer, Romain A; Dawson, Natalie L; Sillitoe, Ian; Das, Sayoni; Yeats, Corin; Dessailly, Benoit H; Rentzsch, Robert; Orengo, Christine A

    2014-01-01

    Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.

  18. Sequence and annotation of the apicoplast genome of the human pathogen Babesia microti.

    PubMed

    Garg, Aprajita; Stein, Anna; Zhao, William; Dwivedi, Ankit; Frutos, Roger; Cornillot, Emmanuel; Ben Mamoun, Choukri

    2014-01-01

    The apicomplexan intraerythrocytic parasite Babesia microti is an emerging human pathogen and the primary cause of human babesiosis, a malaria-like illness endemic in the United States. The pathogen is transmitted to humans by the tick vector, Ixodes scapularis, and by transfusion of blood from asymptomatic B. microti-infected donors. Whereas the nuclear and mitochondrial genomes of this parasite have been sequenced, assembled and annotated, its apicoplast genome remained incomplete, mainly due to its low representation and high A+T content. Here we report the complete sequence and annotation of the apicoplast genome of the B. microti R1 isolate. The genome consists of a 28.7 kb circular molecule encoding primarily functions important for maintenance of the apicoplast DNA, transcription, translation and maturation of organellar proteins. Genome analysis and annotation revealed a unique gene structure and organization of the B. microti apicoplast genome and suggest that all metabolic and non-housekeeping functions in this organelle are nuclear-encoded. B. microti apicoplast functions are significantly different from those of the host, suggesting that they might be useful as targets for development of potent and safe therapies for the treatment of human babesiosis.

  19. Comparative mapping and genomic annotation of the bovine oncosuppressor gene WWOX.

    PubMed

    Manera, S; Bonfiglio, S; Malusà, A; Denis, C; Boussaha, M; Russo, V; Roperto, F; Perucatti, A; Di Meo, G P; Eggen, A; Ferretti, L

    2009-01-01

    WWOX (WW domain-containing oxidoreductase) is the gene mapping at FRA16D HSA16q23.1, the second most active common fragile site in the human genome. In this study we characterized at a detailed molecular level WWOX in the bovine genome. First, we sequenced cDNA from various tissues and obtained evidence in support of a 9-exon structure for the gene, similar to the human gene. Then, we recovered BACs using exon tags and annotated the gene to a >1-Mb genomic region of BTA18 using the Btau 4.0 genome assembly as a reference, thus resolving an issue related to exon 9, which is not included in the genomic annotation of the gene in the Entrez database. Finally, BACs spanning WWOX were used as FISH probes to obtain comparative mapping of the gene in Bos taurus, Bubalus bubalis, Ovis aries and Capra hircus to BTA18q12.1, BBU18q13, OAR14q12.1 and CHI18q12.1, respectively. Our data show that the chromosomal location of WWOX is conserved between man and 4 major domesticated species. Moreover, the annotation of the bovine gene also suggests a highly conserved genomic arrangement, including number and size of introns.

  20. Resources for Biological Annotation of the Drosophila Genome

    SciTech Connect

    Gerald M. Rubin

    2005-08-08

    This project supported seed money for the development of cDNA and genetic resources to support studies of the Drosophila melanogaster genome. Key publications supported by this work that provide additional detail: (1) ''The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes''; and (2) ''The Berkeley Drosophila Genome Project gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes''.

  1. Genome sequencing, annotation of Citrobacter freundii strain GTC 09479.

    PubMed

    Kimura, Kazuyuki; Kumar, Shailesh; Takeo, Masahiro; Mayilraj, Shanmugam

    2014-12-01

    We report the 4.9-Mb genome sequence of Citrobacter freundii strain GTC 09479, isolated from urine sample collected during the year 1983 at Gifu University Graduate School of Medicine, Japan. This draft genome consist of 4,899,578 bp with 51.62% G + C, 4,574 predicted CDSs, 72 tRNAs and 10 rRNAs.

  2. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  3. Mercator: a fast and simple web server for genome scale functional annotation of plant sequence data.

    PubMed

    Lohse, Marc; Nagel, Axel; Herter, Thomas; May, Patrick; Schroda, Michael; Zrenner, Rita; Tohge, Takayuki; Fernie, Alisdair R; Stitt, Mark; Usadel, Björn

    2014-05-01

    Next-generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan 'BIN' ontology, which is tailored for functional annotation of plant 'omics' data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan-to-GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator.

  4. Draft genome assembly and annotation of Glycyrrhiza uralensis, a medicinal legume.

    PubMed

    Mochida, Keiichi; Sakurai, Tetsuya; Seki, Hikaru; Yoshida, Takuhiro; Takahagi, Kotaro; Sawai, Satoru; Uchiyama, Hiroshi; Muranaka, Toshiya; Saito, Kazuki

    2017-01-01

    Chinese liquorice/licorice (Glycyrrhiza uralensis) is a leguminous plant species whose roots and rhizomes have been widely used as a herbal medicine and natural sweetener. Whole-genome sequencing is essential for gene discovery studies and molecular breeding in liquorice. Here, we report a draft assembly of the approximately 379-Mb whole-genome sequence of strain 308-19 of G. uralensis; this assembly contains 34 445 predicted protein-coding genes. Comparative analyses suggested well-conserved genomic components and collinearity of gene loci (synteny) between the genome of liquorice and those of other legumes such as Medicago and chickpea. We observed that three genes involved in isoflavonoid biosynthesis, namely, 2-hydroxyisoflavanone synthase (CYP93C), 2,7,4'-trihydroxyisoflavanone 4'-O-methyltransferase/isoflavone 4'-O-methyltransferase (HI4OMT) and isoflavone-7-O-methyltransferase (7-IOMT) formed a cluster on the scaffold of the liquorice genome and showed conserved microsynteny with Medicago and chickpea. Based on the liquorice genome annotation, we predicted genes in the P450 and UDP-dependent glycosyltransferase (UGT) superfamilies, some of which are involved in triterpenoid saponin biosynthesis, and characterised their gene expression with the reference genome sequence. The genome sequencing and its annotations provide an essential resource for liquorice improvement through molecular breeding and the discovery of useful genes for engineering bioactive components through synthetic biology approaches.

  5. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.

  6. Citrus sinensis annotation project (CAP): a comprehensive database for sweet orange genome.

    PubMed

    Wang, Jia; Chen, Dijun; Lei, Yang; Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/.

  7. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  8. BambooGDB: a bamboo genome database with functional annotation and an analysis platform.

    PubMed

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org.

  9. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci.

    PubMed

    Trynka, Gosia; Westra, Harm-Jan; Slowikowski, Kamil; Hu, Xinli; Xu, Han; Stranger, Barbara E; Klein, Robert J; Han, Buhm; Raychaudhuri, Soumya

    2015-07-02

    Identifying genomic annotations that differentiate causal from trait-associated variants is essential to fine mapping disease loci. Although many studies have identified non-coding functional annotations that overlap disease-associated variants, these annotations often colocalize, complicating the ability to use these annotations for fine mapping causal variation. We developed a statistical approach (Genomic Annotation Shifter [GoShifter]) to assess whether enriched annotations are able to prioritize causal variation. GoShifter defines the null distribution of an annotation overlapping an allele by locally shifting annotations; this approach is less sensitive to biases arising from local genomic structure than commonly used enrichment methods that depend on SNP matching. Local shifting also allows GoShifter to identify independent causal effects from colocalizing annotations. Using GoShifter, we confirmed that variants in expression quantitative trail loci drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites and independently through 3' UTR regulation. We also showed that (1) 15%-36% of trait-associated loci map to DHSs independently of other annotations; (2) loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies; (3) variants associated with height are highly enriched in embryonic stem cell DHSs; and (4) we can effectively prioritize causal variation at specific loci.

  10. An Updated genome annotation for the model marine bacterium Ruegeria pomeroyi DSS-3

    PubMed Central

    2014-01-01

    When the genome of Ruegeria pomeroyi DSS-3 was published in 2004, it represented the first sequence from a heterotrophic marine bacterium. Over the last ten years, the strain has become a valuable model for understanding the cycling of sulfur and carbon in the ocean. To ensure that this genome remains useful, we have updated 69 genes to incorporate functional annotations based on new experimental data, and improved the identification of 120 protein-coding regions based on proteomic and transcriptomic data. We review the progress made in understanding the biology of R. pomeroyi DSS-3 and list the changes made to the genome. PMID:25780504

  11. Evolutionary Interrogation of Human Biology in Well-Annotated Genomic Framework of Rhesus Macaque

    PubMed Central

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E.; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-01-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology. PMID:24577841

  12. Evolutionary interrogation of human biology in well-annotated genomic framework of rhesus macaque.

    PubMed

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-05-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.

  13. Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 from gene level to protein function.

    PubMed

    Valli, Minoska; Tatto, Nadine E; Peymann, Armin; Gruber, Clemens; Landes, Nils; Ekker, Heinz; Thallinger, Gerhard G; Mattanovich, Diethard; Gasser, Brigitte; Graf, Alexandra B

    2016-09-01

    As manually curated and non-automated BLAST analysis of the published Pichia pastoris genome sequences revealed many differences between the gene annotations of the strains GS115 and CBS7435, RNA-Seq analysis, supported by proteomics, was performed to improve the genome annotation. Detailed analysis of sequence alignment and protein domain predictions were made to extend the functional genome annotation to all P. pastoris sequences. This allowed the identification of 492 new ORFs, 4916 hypothetical UTRs and the correction of 341 incorrect ORF predictions, which were mainly due to the presence of upstream ATG or erroneous intron predictions. Moreover, 175 previously erroneously annotated ORFs need to be removed from the annotation. In total, we have annotated 5325 ORFs. Regarding the functionality of those genes, we improved all gene and protein descriptions. Thereby, the percentage of ORFs with functional annotation was increased from 48% to 73%. Furthermore, we defined functional groups, covering 25 biological cellular processes of interest, by grouping all genes that are part of the defined process. All data are presented in the newly launched genome browser and database available at www.pichiagenome.org In summary, we present a wide spectrum of curation of the P. pastoris genome annotation from gene level to protein function.

  14. Assigning biological functions to rice genes by genome annotation, expression analysis and mutagenesis.

    PubMed

    Jiang, Shu-Ye; Ramachandran, Srinivasan

    2010-12-01

    Rice is the first cereal genome to be completely sequenced. Since the completion of its genome sequencing, considerable progress has been made in multiple areas including the whole genome annotation, gene expression profiling, mutant collection, etc. Here, we summarize the current status of rice genome annotation and review the methodology of assigning biological functions to hundreds of thousands of rice genes as well as discuss the major limitations and the future perspective in rice functional genomics. Available data analysis shows that the rice genome encodes around 32,000 protein-coding genes. Expression analysis revealed at least 31,000 genes with expression evidence from full-length cDNA/EST collection or other transcript profiling. In addition, we have summarized various strategies to generate mutant population including natural, physical, chemical, T-DNA, transposon/retrotransposon or gene silencing based mutagenesis. Currently, more than 1 million of mutants have been generated and 27,551 of them have their flanking sequence tags. To assign biological functions to hundreds of thousands of rice genes, global co-operations are required, various genetic resources should be more easily accessible and diverse data from transcriptomics, proteomics, epigenetics, comparative genomics and bioinformatics should be integrated to better understand the functions of these genes and their regulatory mechanisms.

  15. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    PubMed Central

    Loevenich, Sandra N; Brunner, Erich; King, Nichole L; Deutsch, Eric W; Stein, Stephen E; Aebersold, Ruedi; Hafen, Ernst

    2009-01-01

    Background Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments. Results In this manuscript, we present the Drosophila melanogaster PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s) in which it was observed. Conclusion PeptideAtlas is an open access database for the Drosophila community that has several features and applications that support (1) reduction of the complexity inherently associated with performing targeted proteomic studies, (2) designing and accelerating shotgun proteomics experiments, (3) confirming or questioning gene models, and (4) adjusting gene models such that they are in line with observed Drosophila peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert. PMID:19210778

  16. Whole genome profiling physical map and ancestral annotation of tobacco Hicks Broadleaf.

    PubMed

    Sierro, Nicolas; van Oeveren, Jan; van Eijk, Michiel J T; Martin, Florian; Stormo, Keith E; Peitsch, Manuel C; Ivanov, Nikolai V

    2013-09-01

    Genomics-based breeding of economically important crops such as banana, coffee, cotton, potato, tobacco and wheat is often hampered by genome size, polyploidy and high repeat content. We adapted sequence-based whole-genome profiling (WGP™) technology to obtain insight into the polyploidy of the model plant Nicotiana tabacum (tobacco). N. tabacum is assumed to originate from a hybridization event between ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis approximately 200,000 years ago. This resulted in tobacco having a haploid genome size of 4500 million base pairs, approximately four times larger than the related tomato (Solanum lycopersicum) and potato (Solanum tuberosum) genomes. In this study, a physical map containing 9750 contigs of bacterial artificial chromosomes (BACs) was constructed. The mean contig size was 462 kbp, and the calculated genome coverage equaled the estimated tobacco genome size. We used a method for determination of the ancestral origin of the genome by annotation of WGP sequence tags. This assignment agreed with the ancestral annotation available from the tobacco genetic map, and may be used to investigate the evolution of homoeologous genome segments after polyploidization. The map generated is an essential scaffold for the tobacco genome. We propose the combination of WGP physical mapping technology and tag profiling of ancestral lines as a generally applicable method to elucidate the ancestral origin of genome segments of polyploid species. The physical mapping of genes and their origins will enable application of biotechnology to polyploid plants aimed at accelerating and increasing the precision of breeding for abiotic and biotic stress resistance.

  17. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  18. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  19. X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis

    PubMed Central

    Yates, Tim; Okoniewski, Michał J.; Miller, Crispin J.

    2008-01-01

    Affymetrix exon arrays aim to target every known and predicted exon in the human, mouse or rat genomes, and have reporters that extend beyond protein coding regions to other areas of the transcribed genome. This combination of increased coverage and precision is important because a substantial proportion of protein coding genes are predicted to be alternatively spliced, and because many non-coding genes are known also to be of biological significance. In order to fully exploit these arrays, it is necessary to associate each reporter on the array with the features of the genome it is targeting, and to relate these to gene and genome structure. X:Map is a genome annotation database that provides this information. Data can be browsed using a novel Google-maps based interface, and analysed and further visualized through an associated BioConductor package. The database can be found at http://xmap.picr.man.ac.uk. PMID:17932061

  20. Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation.

    PubMed

    Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias

    2014-08-01

    Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications.

  1. Genome assembly has a major impact on gene content: a comparison of annotation in two Bos taurus assemblies.

    PubMed

    Florea, Liliana; Souvorov, Alexander; Kalbfleisch, Theodore S; Salzberg, Steven L

    2011-01-01

    Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12-20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6-15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly.

  2. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations

    PubMed Central

    Zheng, Jiaping; Yu, Hong

    2016-01-01

    Background Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients’ notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care. Objective We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients. Methods First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians’ agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems. Results Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen’s kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term

  3. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis.

    PubMed

    Zhu, Yafeng; Engström, Pär G; Tellgren-Roth, Christian; Baudo, Charles D; Kennell, John C; Sun, Sheng; Billmyre, R Blake; Schröder, Markus S; Andersson, Anna; Holm, Tina; Sigurgeirsson, Benjamin; Wu, Guangxi; Sankaranarayanan, Sundar Ram; Siddharthan, Rahul; Sanyal, Kaustuv; Lundeberg, Joakim; Nystedt, Björn; Boekhout, Teun; Dawson, Thomas L; Heitman, Joseph; Scheynius, Annika; Lehtiö, Janne

    2017-01-18

    Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies.

  4. Functional Analysis of Frequent Lung Cancer Amplicons Under Current Genome Annotation

    DTIC Science & Technology

    2015-12-01

    Exemplary groups of such genes include many non-protein-coding genes like long non-coding RNAs and microRNAs (miRNAs). In study, we examined the PVT1...lung tumorigenesis. The implication is that future studies of the 8q24 amplicon should concentrate on characterizing the PVT1 RNA transcript. 15...SUBJECT TERMS Noncoding RNA , microRNA, lung cancer, Genome annotation 16. SECURITY CLASSIFICATION OF: U 17. LIMITATION OF ABSTRACT 18. NUMBER OF

  5. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    SciTech Connect

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; Hyatt, Doug; Pan, Chongle

    2014-01-01

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accurate comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.

  6. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    DOE PAGES

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk; ...

    2014-01-01

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accuratemore » comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.« less

  7. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.

    PubMed

    Pruitt, Kim D; Tatusova, Tatiana; Brown, Garth R; Maglott, Donna R

    2012-01-01

    The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).

  8. Automatic annotation of experimentally derived, evolutionarily conserved post-translational modifications onto multiple genomes.

    PubMed

    Sridhara, Viswanadham; Marchler-Bauer, Aron; Bryant, Stephen H; Geer, Lewis Y

    2011-01-01

    New generation sequencing technologies have resulted in significant increases in the number of complete genomes. Functional characterization of these genomes, such as by high-throughput proteomics, is an important but challenging task due to the difficulty of scaling up existing experimental techniques. By use of comparative genomics techniques, experimental results can be transferred from one genome to another, while at the same time minimizing errors by requiring discovery in multiple genomes. In this study, protein phosphorylation, an essential component of many cellular processes, is studied using data from large-scale proteomics analyses of the phosphoproteome. Phosphorylation sites from Homo sapiens, Mus musculus and Drosophila melanogaster phosphopeptide data sets were mapped onto conserved domains in NCBI's manually curated portion of Conserved Domain Database (CDD). In this subset, 25 phosphorylation sites are found to be evolutionarily conserved between the three species studied. Transfer of phosphorylation annotation of these conserved sites onto sequences sharing the same conserved domains yield 3253 phosphosite annotations for proteins from coelomata, the taxonomic division that spans H. sapiens, M. musculus and D. melanogaster. The method scales automatically, so as the amount of experimental phosphoproteomics data increases, more conserved phosphorylation sites may be revealed.

  9. A bioinformatics approach to reanalyze the genome annotation of kinetoplastid protozoan parasite Leishmania donovani.

    PubMed

    Pawar, Harsh; Kulkarni, Aditi; Dixit, Tanwi; Chaphekar, Deepa; Patole, Milind S

    2014-12-01

    Leishmania donovani is a kinetoplastid protozoan parasite which causes the fatal disease visceral leishmaniasis in humans. Genome sequencing of L. donovani revealed information about the arrangement of genes and genome architecture. After curation of the genome sequence, many genes in L. donovani were assigned as truncated or "partial" genes by the genome sequencing group. In the present study, we have carried out an extensive analysis and attempted to improve the gene models of these partial genes. Our analysis resulted in the identification of 308 partial genes in L. donovani, which were further categorized as C-terminal extensions, joining of genes, tandemly repeated paralogs and wrong chromosomal assignments. We have analyzed each of these genes from these categories and have improved the annotation of existing gene models in L. donovani. Some of these corrections have been confirmed by mass spectrometry derived peptide data from our previous comparative proteogenomics study in L. donovani.

  10. Genome-scale phylogenetic function annotation of large and diverse protein families.

    PubMed

    Engelhardt, Barbara E; Jordan, Michael I; Srouji, John R; Brenner, Steven E

    2011-11-01

    The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.

  11. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported

  12. A statistical framework for improving genomic annotations of prokaryotic essential genes.

    PubMed

    Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

    2013-01-01

    Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our

  13. Sequencing and genome annotation of honey bee microsporidia parasite, Nosema apis and comparative genome analysis with its sympatric congener, N. ceranae

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Here we present a draft genome sequence and annotation of the honey bee microsporidian parasite, Nosema apis. We applied the whole-genome shotgun (WGS) sequencing approach to sequence and assemble the genome of N. apis to 22-fold sequence coverage. We predicted 2927 protein-coding genes in the N. ...

  14. A field guide to whole-genome sequencing, assembly and annotation

    PubMed Central

    Ekblom, Robert; Wolf, Jochen B W

    2014-01-01

    Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects. PMID:25553065

  15. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation.

    PubMed

    Stepanov, Victor G; Tirumalai, Madhan R; Montazari, Saied; Checinska, Aleksandra; Venkateswaran, Kasthuri; Fox, George E

    2016-01-01

    Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV) radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains.

  16. Bacillus pumilus SAFR-032 Genome Revisited: Sequence Update and Re-Annotation

    PubMed Central

    Stepanov, Victor G.; Tirumalai, Madhan R.; Montazari, Saied; Checinska, Aleksandra; Venkateswaran, Kasthuri

    2016-01-01

    Bacillus pumilus strain SAFR-032 is a non-pathogenic spore-forming bacterium exhibiting an anomalously high persistence in bactericidal environments. In its dormant state, it is capable of withstanding doses of ultraviolet (UV) radiation or hydrogen peroxide, which are lethal for the vast majority of microorganisms. This unusual resistance profile has made SAFR-032 a reference strain for studies of bacterial spore resistance. The complete genome sequence of B. pumilus SAFR-032 was published in 2007 early in the genomics era. Since then, the SAFR-032 strain has frequently been used as a source of genetic/genomic information that was regarded as representative of the entire B. pumilus species group. Recently, our ongoing studies of conservation of gene distribution patterns in the complete genomes of various B. pumilus strains revealed indications of misassembly in the B. pumilus SAFR-032 genome. Synteny-driven local genome resequencing confirmed that the original SAFR-032 sequence contained assembly errors associated with long sequence repeats. The genome sequence was corrected according to the new findings. In addition, a significantly improved annotation is now available. Gene orders were compared and portions of the genome arrangement were found to be similar in a wide spectrum of Bacillus strains. PMID:27351589

  17. Annotation of the Asian Citrus Psyllid Genome Reveals a Reduced Innate Immune System

    PubMed Central

    Arp, Alex P.; Hunter, Wayne B.; Pelz-Stelinski, Kirsten S.

    2016-01-01

    Citrus production worldwide is currently facing significant losses due to citrus greening disease, also known as Huanglongbing. The citrus greening bacteria, Candidatus Liberibacter asiaticus (CLas), is a persistent propagative pathogen transmitted by the Asian citrus psyllid, Diaphorina citri Kuwayama (Hemiptera: Liviidae). Hemipterans characterized to date lack a number of insect immune genes, including those associated with the Imd pathway targeting Gram-negative bacteria. The D. citri draft genome was used to characterize the immune defense genes present in D. citri. Predicted mRNAs identified by screening the published D. citri annotated draft genome were manually searched using a custom database of immune genes from previously annotated insect genomes. Toll and JAK/STAT pathways, general defense genes Dual oxidase, Nitric oxide synthase, prophenoloxidase, and cellular immune defense genes were present in D. citri. In contrast, D. citri lacked genes for the Imd pathway, most antimicrobial peptides, 1,3-β-glucan recognition proteins (GNBPs), and complete peptidoglycan recognition proteins. These data suggest that D. citri has a reduced immune capability similar to that observed in A. pisum, P. humanus, and R. prolixus. The absence of immune system genes from the D. citri genome may facilitate CLas infections, and is possibly compensated for by their relationship with their microbial endosymbionts. PMID:27965582

  18. Annotation of the Asian Citrus Psyllid Genome Reveals a Reduced Innate Immune System.

    PubMed

    Arp, Alex P; Hunter, Wayne B; Pelz-Stelinski, Kirsten S

    2016-01-01

    Citrus production worldwide is currently facing significant losses due to citrus greening disease, also known as Huanglongbing. The citrus greening bacteria, Candidatus Liberibacter asiaticus (CLas), is a persistent propagative pathogen transmitted by the Asian citrus psyllid, Diaphorina citri Kuwayama (Hemiptera: Liviidae). Hemipterans characterized to date lack a number of insect immune genes, including those associated with the Imd pathway targeting Gram-negative bacteria. The D. citri draft genome was used to characterize the immune defense genes present in D. citri. Predicted mRNAs identified by screening the published D. citri annotated draft genome were manually searched using a custom database of immune genes from previously annotated insect genomes. Toll and JAK/STAT pathways, general defense genes Dual oxidase, Nitric oxide synthase, prophenoloxidase, and cellular immune defense genes were present in D. citri. In contrast, D. citri lacked genes for the Imd pathway, most antimicrobial peptides, 1,3-β-glucan recognition proteins (GNBPs), and complete peptidoglycan recognition proteins. These data suggest that D. citri has a reduced immune capability similar to that observed in A. pisum, P. humanus, and R. prolixus. The absence of immune system genes from the D. citri genome may facilitate CLas infections, and is possibly compensated for by their relationship with their microbial endosymbionts.

  19. Non-coding RNA annotation of the genome of Trichoplax adhaerens.

    PubMed

    Hertel, Jana; de Jong, Danielle; Marz, Manja; Rose, Dominic; Tafer, Hakim; Tanzer, Andrea; Schierwater, Bernd; Stadler, Peter F

    2009-04-01

    A detailed annotation of non-protein coding RNAs is typically missing in initial releases of newly sequenced genomes. Here we report on a comprehensive ncRNA annotation of the genome of Trichoplax adhaerens, the presumably most basal metazoan whose genome has been published to-date. Since blast identified only a small fraction of the best-conserved ncRNAs--in particular rRNAs, tRNAs and some snRNAs--we developed a semi-global dynamic programming tool, GotohScan, to increase the sensitivity of the homology search. It successfully identified the full complement of major and minor spliceosomal snRNAs, the genes for RNase P and MRP RNAs, the SRP RNA, as well as several small nucleolar RNAs. We did not find any microRNA candidates homologous to known eumetazoan sequences. Interestingly, most ncRNAs, including the pol-III transcripts, appear as single-copy genes or with very small copy numbers in the Trichoplax genome.

  20. MIPS: analysis and annotation of proteins from whole genomes in 2005.

    PubMed

    Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V

    2006-01-01

    The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).

  1. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation

    SciTech Connect

    Mavromatis, K; Land, Miriam L; Brettin, Thomas S; Quest, Daniel J; Copeland, A; Clum, Alicia; Goodwin, Lynne A.; Woyke, Tanja; Lapidus, Alla L.; Klenk, Hans-Peter; Cottingham, Robert W; Kyrpides, Nikos C

    2012-01-01

    Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).

  2. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  3. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals.

    PubMed

    Nikolaichik, Yevgeny; Damienikan, Aliaksandr U

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a 'gene by gene' approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn't fit with regulatory information allowed us to correct product and gene names for over 300 loci.

  4. Sequencing and annotated analysis of full genome of Holstein breed bull.

    PubMed

    Kõks, Sulev; Reimann, Ene; Lilleoja, Rutt; Lättekivi, Freddy; Salumets, Andres; Reemann, Paula; Jaakma, Ülle

    2014-08-01

    In the present study, we describe the deep sequencing and structural analysis of the Holstein breed bull genome. Our aim was to receive a high-quality Holstein bull genome reference sequence and to describe different types of variations in its genome compared to Hereford breed as a reference. We generated four mate-paired libraries and one fragment library from 30 μg of genomic DNA. Colour space fasta were mapped and paired to the reference cow (Bos taurus) genome assembly from Oct. 2011 (Baylor 4.6.1/bosTau7). Initial sequencing resulted in the 4,864,054,296 of 50-bp reads. Average mapping efficiency was 71.7 % and altogether 3,494,534,136 reads and 157,928,163,086 bp were successfully mapped, resulting in 60 × coverage. This is the highest coverage for bovine genome published so far. Tertiary analysis found 6,362,988 SNPs in the bull's genome, 4,045,889 heterozygous and 2,317,099 homozygous variants. Annotation revealed that 4,330,337 of all discovered SNPs were annotated in the dbSNP database (build 137) and therefore 2,032,651 SNPs were novel. Large indel variations accounted for the 245,947,845 bp of the variation in entire genome and their number was 312,879. We also found that small indels (number was 633,310) accounted for the total variation of 2,542,552 nucleotides in the genome. Only 106,768 small indels were listed in the dbSNP. Finally, we identified 2,758 inversions in the genome of the bull covering in total 23,099,054 bp of genome's variation. The largest inversion was 87,440 bp in size. In conclusion, the present study discovered different types of novel variants in bull's genome after high-coverage sequencing. Better knowledge of the functions of these variations is needed.

  5. Discovery and annotation of small proteins using genomics, proteomics and computational approaches

    SciTech Connect

    Yang, Xiaohan; Tschaplinski, Timothy J.; Hurst, Gregory B.; Jawdy, Sara; Abraham, Paul E.; Lankford, Patricia K.; Adams, Rachel M.; Shah, Manesh B.; Hettich, Robert L.; Lindquist, Erika; Kalluri, Udaya C.; Gunter, Lee E.; Pennacchio, Christa; Tuskan, Gerald A.

    2011-03-02

    Small proteins (10 200 amino acids aa in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained 2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) codingpotential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  6. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.

    PubMed

    Yang, Xiaohan; Tschaplinski, Timothy J; Hurst, Gregory B; Jawdy, Sara; Abraham, Paul E; Lankford, Patricia K; Adams, Rachel M; Shah, Manesh B; Hettich, Robert L; Lindquist, Erika; Kalluri, Udaya C; Gunter, Lee E; Pennacchio, Christa; Tuskan, Gerald A

    2011-04-01

    Small proteins (10-200 amino acids [aa] in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained ~2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10-200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) coding-potential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  7. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing

  8. Genome, functional gene annotation, and nuclear transformation of the heterokont oleaginous alga Nannochloropsis oceanica CCMP1779.

    PubMed

    Vieler, Astrid; Wu, Guangxi; Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J; Campbell, Michael S; Cavalier, David; Childs, Kevin L; Clark, Teresa J; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L; Sanjaya; Simpson, Jeffrey P; Terbush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M; Hegg, Eric L; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K; Ohlrogge, John; Osteryoung, Katherine W; Shachar-Hill, Yair; Sears, Barbara B; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica-specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing

  9. ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes

    PubMed Central

    Gupta, Sushim Kumar; Padmanabhan, Babu Roshan; Diene, Seydina M.; Lopez-Rojas, Rafael; Kempf, Marie; Landraud, Luce

    2014-01-01

    ARG-ANNOT (Antibiotic Resistance Gene-ANNOTation) is a new bioinformatic tool that was created to detect existing and putative new antibiotic resistance (AR) genes in bacterial genomes. ARG-ANNOT uses a local BLAST program in Bio-Edit software that allows the user to analyze sequences without a Web interface. All AR genetic determinants were collected from published works and online resources; nucleotide and protein sequences were retrieved from the NCBI GenBank database. After building a database that includes 1,689 antibiotic resistance genes, the software was tested in a blind manner using 100 random sequences selected from the database to verify that the sensitivity and specificity were at 100% even when partial sequences were queried. Notably, BLAST analysis results obtained using the rmtF gene sequence (a new aminoglycoside-modifying enzyme gene sequence that is not included in the database) as a query revealed that the tool was able to link this sequence to short sequences (17 to 40 bp) found in other genes of the rmt family with significant E values. Finally, the analysis of 178 Acinetobacter baumannii and 20 Staphylococcus aureus genomes allowed the detection of a significantly higher number of AR genes than the Resfinder gene analyzer and 11 point mutations in target genes known to be associated with AR. The average time for the analysis of a genome was 3.35 ± 0.13 min. We have created a concise database for BLAST using a Bio-Edit interface that can detect AR genetic determinants in bacterial genomes and can rapidly and easily discover putative new AR genetic determinants. PMID:24145532

  10. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations

    PubMed Central

    dos Santos, Gilberto; Schroeder, Andrew J.; Goodman, Joshua L.; Strelets, Victor B.; Crosby, Madeline A.; Thurmond, Jim; Emmert, David B.; Gelbart, William M.

    2015-01-01

    Release 6, the latest reference genome assembly of the fruit fly Drosophila melanogaster, was released by the Berkeley Drosophila Genome Project in 2014; it replaces their previous Release 5 genome assembly, which had been the reference genome assembly for over 7 years. With the enormous amount of information now attached to the D. melanogaster genome in public repositories and individual laboratories, the replacement of the previous assembly by the new one is a major event requiring careful migration of annotations and genome-anchored data to the new, improved assembly. In this report, we describe the attributes of the new Release 6 reference genome assembly, the migration of FlyBase genome annotations to this new assembly, how genome features on this new assembly can be viewed in FlyBase (http://flybase.org) and how users can convert coordinates for their own data to the corresponding Release 6 coordinates. PMID:25398896

  11. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations.

    PubMed

    dos Santos, Gilberto; Schroeder, Andrew J; Goodman, Joshua L; Strelets, Victor B; Crosby, Madeline A; Thurmond, Jim; Emmert, David B; Gelbart, William M

    2015-01-01

    Release 6, the latest reference genome assembly of the fruit fly Drosophila melanogaster, was released by the Berkeley Drosophila Genome Project in 2014; it replaces their previous Release 5 genome assembly, which had been the reference genome assembly for over 7 years. With the enormous amount of information now attached to the D. melanogaster genome in public repositories and individual laboratories, the replacement of the previous assembly by the new one is a major event requiring careful migration of annotations and genome-anchored data to the new, improved assembly. In this report, we describe the attributes of the new Release 6 reference genome assembly, the migration of FlyBase genome annotations to this new assembly, how genome features on this new assembly can be viewed in FlyBase (http://flybase.org) and how users can convert coordinates for their own data to the corresponding Release 6 coordinates.

  12. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  13. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns

    PubMed Central

    Christie, Karen R.; Hong, Eurie L.; Cherry, J. Michael

    2011-01-01

    The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions. PMID:19577472

  14. FEATnotator: A tool for integrated annotation of sequence features and variation, facilitating interpretation in genomics experiments.

    PubMed

    Podicheti, Ram; Mockaitis, Keithanne

    2015-06-01

    As approaches are sought for more efficient and democratized uses of non-model and expanded model genomics references, ease of integration of genomic feature datasets is especially desirable in multidisciplinary research communities. Valuable conclusions are often missed or slowed when researchers refer experimental results to a single reference sequence that lacks integrated pan-genomic and multi-experiment data in accessible formats. Association of genomic positional information, such as results from an expansive variety of next-generation sequencing experiments, with annotated reference features such as genes or predicted protein binding sites, provides the context essential for conclusions and ongoing research. When the experimental system includes polymorphic genomic inputs, rapid calculation of gene structural and protein translational effects of sequence variation from the reference can be invaluable. Here we present FEATnotator, a lightweight, fast and easy to use open source software program that integrates and reports overlap and proximity in genomic information from any user-defined datasets including those from next generation sequencing applications. We illustrate use of the tool by summarizing whole genome sequence variation of a widely used natural isolate of Arabidopsis thaliana in the context of gene models of the reference accession. Previous discovery of a protein coding deletion influencing root development is replicated rapidly. Appropriate even in investigations of a single gene or genic regions such as QTL, comprehensive reports provided by FEATnotator better prepare researchers for interpretation of their experimental results. The tool is available for download at http://featnotator.sourceforge.net.

  15. Complete annotated genome sequence of Mycobacterium tuberculosis (Zopf) Lehmann and Neumann (ATCC35812) (Kurono).

    PubMed

    Miyoshi-Akiyama, Tohru; Satou, Kazuhito; Kato, Masako; Shiroma, Akino; Matsumura, Kazunori; Tamotsu, Hinako; Iwai, Hiroki; Teruya, Kuniko; Funatogawa, Keiji; Hirano, Takashi; Kirikae, Teruo

    2015-01-01

    We report the completely annotated genome sequence of Mycobacterium tuberculosis (Zopf) Lehmann and Neumann (ATCC35812) (Kurono), which is a used for virulence and/or immunization studies. The complete genome sequence of M. tuberculosis Kurono was determined with a length of 4,415,078 bp and a G+C content of 65.60%. The chromosome was shown to contain a total of 4,340 protein-coding genes, 53 tRNA genes, one transfer messenger RNA for all amino acids, and 1 rrn operon. Lineage analysis based on large sequence polymorphisms indicated that M. tuberculosis Kurono belongs to the Euro-American lineage (lineage 4). Phylogenetic analysis using whole genome sequences of M. tuberculosis Kurono in addition to 22 M. tuberculosis complex strains indicated that H37Rv is the closest relative of Kurono based on the results of phylogenetic analysis. These findings provide a basis for research using M. tuberculosis Kurono, especially in animal models.

  16. TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

    PubMed Central

    Lorenzi, Hernan A.; Hoover, Jeff; Inman, Jason; Safford, Todd; Murphy, Sean; Kagan, Leonid; Williamson, Shannon J.

    2011-01-01

    In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence. PMID:21886867

  17. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes.

    PubMed

    Zhang, Jia; Yang, Ming-Kun; Zeng, Honghui; Ge, Feng

    2016-11-01

    Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/.

  18. Citrus sinensis Annotation Project (CAP): A Comprehensive Database for Sweet Orange Genome

    PubMed Central

    Chang, Ji-Wei; Hao, Bao-Hai; Xing, Feng; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Chen, Ling-Ling

    2014-01-01

    Citrus is one of the most important and widely grown fruit crop with global production ranking firstly among all the fruit crops in the world. Sweet orange accounts for more than half of the Citrus production both in fresh fruit and processed juice. We have sequenced the draft genome of a double-haploid sweet orange (C. sinensis cv. Valencia), and constructed the Citrus sinensis annotation project (CAP) to store and visualize the sequenced genomic and transcriptome data. CAP provides GBrowse-based organization of sweet orange genomic data, which integrates ab initio gene prediction, EST, RNA-seq and RNA-paired end tag (RNA-PET) evidence-based gene annotation. Furthermore, we provide a user-friendly web interface to show the predicted protein-protein interactions (PPIs) and metabolic pathways in sweet orange. CAP provides comprehensive information beneficial to the researchers of sweet orange and other woody plants, which is freely available at http://citrus.hzau.edu.cn/. PMID:24489955

  19. GO-FAANG meeting: a Gathering On Functional Annotation of Animal Genomes.

    PubMed

    Tuggle, Christopher K; Giuffra, Elisabetta; White, Stephen N; Clarke, Laura; Zhou, Huaijun; Ross, Pablo J; Acloque, Hervé; Reecy, James M; Archibald, Alan; Bellone, Rebecca R; Boichard, Michèle; Chamberlain, Amanda; Cheng, Hans; Crooijmans, Richard P M A; Delany, Mary E; Finno, Carrie J; Groenen, Martien A M; Hayes, Ben; Lunney, Joan K; Petersen, Jessica L; Plastow, Graham S; Schmidt, Carl J; Song, Jiuzhou; Watson, Mick

    2016-10-01

    The Functional Annotation of Animal Genomes (FAANG) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of domesticated and non-model organisms (www.faang.org). The workshop gathered together from around the world a group of 100+ genome scientists, administrators, representatives of funding agencies and commodity groups to discuss the latest advancements of the consortium, new perspectives, next steps and implementation plans. The workshop was streamed live and recorded, and all talks, along with speaker slide presentations, are available at www.faang.org. In this report, we describe the major activities and outcomes of this meeting. We also provide updates on ongoing efforts to implement discussions and decisions taken at GO-FAANG to guide future FAANG activities. In summary, reference datasets are being established under pilot projects; plans for tissue sets, morphological classification and methods of sample collection for different tissues were organized; and core assays and data and meta-data analysis standards were established.

  20. MAKER-P: a tool-kit for the creation, management, and quality control of plant genome annotations

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We have optimized and extended the widely used annotation-engine MAKER for use on plant genomes. We have benchmarked the resulting software, MAKER-P, using the A. thaliana genome and the TAIR10 gene models. Here we demonstrate the ability of the MAKER-P toolkit to generate de novo repeat databases, ...

  1. The SOFG Anatomy Entry List (SAEL): An Annotation Tool for Functional Genomics Data

    PubMed Central

    Parkinson, Helen; Aitken, Stuart; Baldock, Richard A.; Bard, Jonathan B. L.; Burger, Albert; Hayamizu, Terry F.; Rector, Alan; Ringwald, Martin; Rogers, Jeremy; Rosse, Cornelius; Stoeckert, Christian J.

    2004-01-01

    A great deal of data in functional genomics studies needs to be annotated with low-resolution anatomical terms. For example, gene expression assays based on manually dissected samples (microarray, SAGE, etc.) need high-level anatomical terms to describe sample origin. First-pass annotation in high-throughput assays (e.g. large-scale in situ gene expression screens or phenotype screens) and bibliographic applications, such as selection of keywords, would also benefit from a minimum set of standard anatomical terms. Although only simple terms are required, the researcher faces serious practical problems of inconsistency and confusion, given the different aims and the range of complexity of existing anatomy ontologies. A Standards and Ontologies for Functional Genomics (SOFG) group therefore initiated discussions between several of the major anatomical ontologies for higher vertebrates. As we report here, one result of these discussions is a simple, accessible, controlled vocabulary of gross anatomical terms, the SOFG Anatomy Entry List (SAEL). The SAEL is available from http://www.sofg.org and is intended as a resource for biologists, curators, bioinformaticians and developers of software supporting functional genomics. It can be used directly for annotation in the contexts described above. Importantly, each term is linked to the corresponding term in each of the major anatomy ontologies. Where the simple list does not provide enough detail or sophistication, therefore, the researcher can use the SAEL to choose the appropriate ontology and move directly to the relevant term as an entry point. The SAEL links will also be used to support computational access to the respective ontologies. PMID:18629134

  2. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-12-02

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.

  3. Integrated Transcriptomic-Proteomic Analysis Using a Proteogenomic Workflow Refines Rat Genome Annotation.

    PubMed

    Kumar, Dhirendra; Yadav, Amit Kumar; Jia, Xinying; Mulvenna, Jason; Dash, Debasis

    2016-01-01

    Proteogenomic re-annotation and mRNA splicing information can lead to the discovery of various protein forms for eukaryotic model organisms like rat. However, detection of novel proteoforms using mass spectrometry proteomics data remains a formidable challenge. We developed EuGenoSuite, an open source multiple algorithmic proteomic search tool and utilized it in our in-house integrated transcriptomic-proteomic pipeline to facilitate automated proteogenomic analysis. Using four proteogenomic pipelines (integrated transcriptomic-proteomic, Peppy, Enosi, and ProteoAnnotator) on publicly available RNA-sequence and MS proteomics data, we discovered 363 novel peptides in rat brain microglia representing novel proteoforms for 249 gene loci in the rat genome. These novel peptides aided in the discovery of novel exons, translation of annotated untranslated regions, pseudogenes, and splice variants for various loci; many of which have known disease associations, including neurological disorders like schizophrenia, amyotrophic lateral sclerosis, etc. Novel isoforms were also discovered for genes implicated in cardiovascular diseases and breast cancer for which rats are considered model organisms. Our integrative multi-omics data analysis not only enables the discovery of new proteoforms but also generates an improved reference for human disease studies in the rat model.

  4. Annotation of plant gene function via combined genomics, metabolomics and informatics.

    PubMed

    Tohge, Takayuki; Fernie, Alisdair R

    2012-06-17

    Given the ever expanding number of model plant species for which complete genome sequences are available and the abundance of bio-resources such as knockout mutants, wild accessions and advanced breeding populations, there is a rising burden for gene functional annotation. In this protocol, annotation of plant gene function using combined co-expression gene analysis, metabolomics and informatics is provided (Figure 1). This approach is based on the theory of using target genes of known function to allow the identification of non-annotated genes likely to be involved in a certain metabolic process, with the identification of target compounds via metabolomics. Strategies are put forward for applying this information on populations generated by both forward and reverse genetics approaches in spite of none of these are effortless. By corollary this approach can also be used as an approach to characterise unknown peaks representing new or specific secondary metabolites in the limited tissues, plant species or stress treatment, which is currently the important trial to understanding plant metabolism.

  5. The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

    PubMed Central

    Yu, Chenggang; Zavaljevski, Nela; Desai, Valmik; Johnson, Seth; Stevens, Fred J; Reifman, Jaques

    2008-01-01

    low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used. Conclusion The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources. PMID:18221520

  6. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs)

    PubMed Central

    Natale, Darren A; Shankavaram, Uma T; Galperin, Michael Y; Wolf, Yuri I; Aravind, L; Koonin, Eugene V

    2000-01-01

    Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully

  7. The de novo genome assembly and annotation of a female domestic dromedary of North African origin.

    PubMed

    Fitak, Robert R; Mohandesan, Elmira; Corander, Jukka; Burger, Pamela A

    2016-01-01

    The single-humped dromedary (Camelus dromedarius) is the most numerous and widespread of domestic camel species and is a significant source of meat, milk, wool, transportation and sport for millions of people. Dromedaries are particularly well adapted to hot, desert conditions and harbour a variety of biological and physiological characteristics with evolutionary, economic and medical importance. To understand the genetic basis of these traits, an extensive resource of genomic variation is required. In this study, we assembled at 65× coverage, a 2.06 Gb draft genome of a female dromedary whose ancestry can be traced to an isolated population from the Canary Islands. We annotated 21,167 protein-coding genes and estimated ~33.7% of the genome to be repetitive. A comparison with the recently published draft genome of an Arabian dromedary resulted in 1.91 Gb of aligned sequence with a divergence of 0.095%. An evaluation of our genome with the reference revealed that our assembly contains more error-free bases (91.2%) and fewer scaffolding errors. We identified ~1.4 million single-nucleotide polymorphisms with a mean density of 0.71 × 10(-3) per base. An analysis of demographic history indicated that changes in effective population size corresponded with recent glacial epochs. Our de novo assembly provides a useful resource of genomic variation for future studies of the camel's adaptations to arid environments and economically important traits. Furthermore, these results suggest that draft genome assemblies constructed with only two differently sized sequencing libraries can be comparable to those sequenced using additional library sizes, highlighting that additional resources might be better placed in technologies alternative to short-read sequencing to physically anchor scaffolds to genome maps.

  8. Insights from the Genome Annotation of Elizabethkingia anophelis from the Malaria Vector Anopheles gambiae

    PubMed Central

    Pei, Dong; Rayl, Melanie; Yu, Wanqin; Steritz, Matthew; Faye, Ingrid; Xu, Jiannong

    2014-01-01

    Elizabethkingia anophelis is a dominant bacterial species in the gut ecosystem of the malaria vector mosquito Anopheles gambiae. We recently sequenced the genomes of two strains of E. anophelis, R26T and Ag1, isolated from different strains of A. gambiae. The two bacterial strains are identical with a few exceptions. Phylogenetically, Elizabethkingia is closer to Chryseobacterium and Riemerella than to Flavobacterium. In line with other Bacteroidetes known to utilize various polymers in their ecological niches, the E. anophelis genome contains numerous TonB dependent transporters with various substrate specificities. In addition, several genes belonging to the polysaccharide utilization system and the glycoside hydrolase family were identified that could potentially be of benefit for the mosquito carbohydrate metabolism. In agreement with previous reports of broad antibiotic resistance in E. anophelis, a large number of genes encoding efflux pumps and β-lactamases are present in the genome. The component genes of resistance-nodulation-division type efflux pumps were found to be syntenic and conserved in different taxa of Bacteroidetes. The bacterium also displays hemolytic activity and encodes several hemolysins that may participate in the digestion of erythrocytes in the mosquito gut. At the same time, the OxyR regulon and antioxidant genes could provide defense against the oxidative stress that is associated with blood digestion. The genome annotation and comparative genomic analysis revealed functional characteristics associated with the symbiotic relationship with the mosquito host. PMID:24842809

  9. Insights from the genome annotation of Elizabethkingia anophelis from the malaria vector Anopheles gambiae.

    PubMed

    Kukutla, Phanidhar; Lindberg, Bo G; Pei, Dong; Rayl, Melanie; Yu, Wanqin; Steritz, Matthew; Faye, Ingrid; Xu, Jiannong

    2014-01-01

    Elizabethkingia anophelis is a dominant bacterial species in the gut ecosystem of the malaria vector mosquito Anopheles gambiae. We recently sequenced the genomes of two strains of E. anophelis, R26T and Ag1, isolated from different strains of A. gambiae. The two bacterial strains are identical with a few exceptions. Phylogenetically, Elizabethkingia is closer to Chryseobacterium and Riemerella than to Flavobacterium. In line with other Bacteroidetes known to utilize various polymers in their ecological niches, the E. anophelis genome contains numerous TonB dependent transporters with various substrate specificities. In addition, several genes belonging to the polysaccharide utilization system and the glycoside hydrolase family were identified that could potentially be of benefit for the mosquito carbohydrate metabolism. In agreement with previous reports of broad antibiotic resistance in E. anophelis, a large number of genes encoding efflux pumps and β-lactamases are present in the genome. The component genes of resistance-nodulation-division type efflux pumps were found to be syntenic and conserved in different taxa of Bacteroidetes. The bacterium also displays hemolytic activity and encodes several hemolysins that may participate in the digestion of erythrocytes in the mosquito gut. At the same time, the OxyR regulon and antioxidant genes could provide defense against the oxidative stress that is associated with blood digestion. The genome annotation and comparative genomic analysis revealed functional characteristics associated with the symbiotic relationship with the mosquito host.

  10. The assembly and annotation of the complete Rufous-bellied thrush mitochondrial genome.

    PubMed

    Gomes de Sá, Pablo; Veras, Adonney; Fontana, Carla Suertegaray; Aleixo, Alexandre; Burlamaqui, Tibério; Mello, Claudio Vianna; de Vasconcelos, Ana Tereza Ribeiro; Prosdocimi, Francisco; Ramos, Rommel; Schneider, Maria; Silva, Artur

    2017-03-01

    Among known bird species, oscines are one of the few groups that produce complex vocalizations due to vocal learning. One of the most conspicuous oscine passerines in southeastern South America is the Rufous-bellied Thrush, Turdus rufiventris. The complete mitochondrial genome of this species was sequenced with the Illumina HiSeq platform (Illumina Inc., San Diego, CA), assembled using MITObim software and annotated by MITOS web server and Artemis software. This mitogenome contained 16 669 bases, organized as 13 protein-coding genes, 22 transfer RNAs, two ribosomal RNAs, and a control region (d-loop). The sequencing of the Rufous-bellied Thrush mitochondrial genome is of particular interest for better understanding of population genetics and phylogeography of the Turdidae family.

  11. Integrative annotation of variants from 1092 humans: application to cancer genomics.

    PubMed

    Khurana, Ekta; Fu, Yao; Colonna, Vincenza; Mu, Xinmeng Jasmine; Kang, Hyun Min; Lappalainen, Tuuli; Sboner, Andrea; Lochovsky, Lucas; Chen, Jieming; Harmanci, Arif; Das, Jishnu; Abyzov, Alexej; Balasubramanian, Suganthi; Beal, Kathryn; Chakravarty, Dimple; Challis, Daniel; Chen, Yuan; Clarke, Declan; Clarke, Laura; Cunningham, Fiona; Evani, Uday S; Flicek, Paul; Fragoza, Robert; Garrison, Erik; Gibbs, Richard; Gümüs, Zeynep H; Herrero, Javier; Kitabayashi, Naoki; Kong, Yong; Lage, Kasper; Liluashvili, Vaja; Lipkin, Steven M; MacArthur, Daniel G; Marth, Gabor; Muzny, Donna; Pers, Tune H; Ritchie, Graham R S; Rosenfeld, Jeffrey A; Sisu, Cristina; Wei, Xiaomu; Wilson, Michael; Xue, Yali; Yu, Fuli; Dermitzakis, Emmanouil T; Yu, Haiyuan; Rubin, Mark A; Tyler-Smith, Chris; Gerstein, Mark

    2013-10-04

    Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

  12. Improved structural annotation of protein-coding genes in the Meloidogyne hapla genome using RNA-Seq.

    PubMed

    Guo, Yuelong; Bird, David McK; Nielsen, Dahlia M

    2014-01-01

    As high-throughput cDNA sequencing (RNA-Seq) is increasingly applied to hypothesis-driven biological studies, the prediction of protein coding genes based on these data are usurping strictly in silico approaches. Compared with computationally derived gene predictions, structural annotation is more accurate when based on biological evidence, particularly RNA-Seq data. Here, we refine the current genome annotation for the Meloidogyne hapla genome utilizing RNA-Seq data. Published structural annotation defines 14 420 protein-coding genes in the M. hapla genome. Of these, 25% (3751) were found to exhibit some incongruence with RNA-Seq data. Manual annotation enabled these discrepancies to be resolved. Our analysis revealed 544 new gene models that were missing from the prior annotation. Additionally, 1457 transcribed regions were newly identified on the ends of as-yet-unjoined contigs. We also searched for trans-spliced leaders, and based on RNA-Seq data, identified genes that appear to be trans-spliced. Four 22-bp trans-spliced leaders were identified using our pipeline, including the known trans-spliced leader, which is the M. hapla ortholog of SL1. In silico predictions of trans-splicing were validated by comparison with earlier results derived from an independent cDNA library constructed to capture trans-spliced transcripts. The new annotation, which we term HapPep5, is publically available at www.hapla.org.

  13. New local potential useful for genome annotation and 3D modeling

    SciTech Connect

    Chandonia, John-Marc; Cohen, Fred E.

    2003-07-17

    A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar (but not identical) to the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10 percent more structures correctly identified as the most likely structural match in a fold library, and 20 percent more structures correctly narrowed down to a set of five possible candidates. JThread also significantly improves the average sequence alignment accuracy, from 53 percent to 62 percent of residues correctly aligned. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.

  14. Improving genome annotation of enterotoxigenic Escherichia coli TW10598 by a label-free quantitative MS/MS approach.

    PubMed

    Pettersen, Veronika Kuchařová; Steinsland, Hans; Wiker, Harald G

    2015-11-01

    The most commonly used genome annotation processes are to a great extent based on computational methods. However, those can only predict genes that have been described earlier or that have sequence signatures indicative of a gene function. Here, we report a synonymous proteogenomic approach for experimentally improving microbial genome annotation based on label-free quantitative MS/MS. The approach is exemplified by analysis of cell extracts from in vitro cultured enterotoxigenic Escherichia coli (ETEC) strain TW10598, as part of an effort to create a new reference ETEC genome sequence. The proteomic analysis yielded identification of 2060 proteins, out of which 312 proteins were originally described as hypothetical. For 84% of the identified proteins we have provided description of their relative quantitative levels, among others, for 20 abundantly expressed ETEC virulence factors. Proteogenomic mapping supported the existence of four protein-coding genes that had not been annotated, and led to correction of translation start positions of another nine. The addition of the proteomic analysis into TW10598 genome re-annotation project improved quality of the annotation, and provided experimental evidence for a significant portion of ETEC expressed proteome. Data are available via ProteomeXchange with identifier PXD002473 (http://proteomecentral.proteomexchange.org/dataset/PXD002473).

  15. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool.

    PubMed

    Motion, Graham B; Howden, Andrew J M; Huitema, Edgar; Jones, Susan

    2015-12-15

    There are currently 151 plants with draft genomes available but levels of functional annotation for putative protein products are low. Therefore, accurate computational predictions are essential to annotate genomes in the first instance, and to provide focus for the more costly and time consuming functional assays that follow. DNA-binding proteins are an important class of proteins that require annotation, but current computational methods are not applicable for genome wide predictions in plant species. Here, we explore the use of species and lineage specific models for the prediction of DNA-binding proteins in plants. We show that a species specific support vector machine model based on Arabidopsis sequence data is more accurate (accuracy 81%) than a generic model (74%), and based on this we develop a plant specific model for predicting DNA-binding proteins. We apply this model to the tomato proteome and demonstrate its ability to perform accurate high-throughput prediction of DNA-binding proteins. In doing so, we have annotated 36 currently uncharacterised proteins by assigning a putative DNA-binding function. Our model is publically available and we propose it be used in combination with existing tools to help increase annotation levels of DNA-binding proteins encoded in plant genomes.

  16. Decoding the oak genome: public release of sequence data, assembly, annotation and publication strategies.

    PubMed

    Plomion, Christophe; Aury, Jean-Marc; Amselem, Joëlle; Alaeitabar, Tina; Barbe, Valérie; Belser, Caroline; Bergès, Hélène; Bodénès, Catherine; Boudet, Nathalie; Boury, Christophe; Canaguier, Aurélie; Couloux, Arnaud; Da Silva, Corinne; Duplessis, Sébastien; Ehrenmann, François; Estrada-Mairey, Barbara; Fouteau, Stéphanie; Francillonne, Nicolas; Gaspin, Christine; Guichard, Cécile; Klopp, Christophe; Labadie, Karine; Lalanne, Céline; Le Clainche, Isabelle; Leplé, Jean-Charles; Le Provost, Grégoire; Leroy, Thibault; Lesur, Isabelle; Martin, Francis; Mercier, Jonathan; Michotey, Célia; Murat, Florent; Salin, Franck; Steinbach, Delphine; Faivre-Rampant, Patricia; Wincker, Patrick; Salse, Jérôme; Quesneville, Hadi; Kremer, Antoine

    2016-01-01

    The 1.5 Gbp/2C genome of pedunculate oak (Quercus robur) has been sequenced. A strategy was established for dealing with the challenges imposed by the sequencing of such a large, complex and highly heterozygous genome by a whole-genome shotgun (WGS) approach, without the use of costly and time-consuming methods, such as fosmid or BAC clone-based hierarchical sequencing methods. The sequencing strategy combined short and long reads. Over 49 million reads provided by Roche 454 GS-FLX technology were assembled into contigs and combined with shorter Illumina sequence reads from paired-end and mate-pair libraries of different insert sizes, to build scaffolds. Errors were corrected and gaps filled with Illumina paired-end reads and contaminants detected, resulting in a total of 17,910 scaffolds (>2 kb) corresponding to 1.34 Gb. Fifty per cent of the assembly was accounted for by 1468 scaffolds (N50 of 260 kb). Initial comparison with the phylogenetically related Prunus persica gene model indicated that genes for 84.6% of the proteins present in peach (mean protein coverage of 90.5%) were present in our assembly. The second and third steps in this project are genome annotation and the assignment of scaffolds to the oak genetic linkage map. In accordance with the Bermuda and Fort Lauderdale agreements and the more recent Toronto Statement, the oak genome data have been released into public sequence repositories in advance of publication. In this presubmission paper, the oak genome consortium describes its principal lines of work and future directions for analyses of the nature, function and evolution of the oak genome.

  17. Functional modelling of an equine bronchoalveolar lavage fluid proteome provides experimental confirmation and functional annotation of equine genome sequences.

    PubMed

    Bright, L A; Mujahid, N; Nanduri, B; McCarthy, F M; Costa, L R R; Burgess, S C; Swiderski, C E

    2011-08-01

    The equine genome sequence enables the use of high-throughput genomic technologies in equine research, but accurate identification of expressed gene products and interpreting their biological relevance require additional structural and functional genome annotation. Here, we employ the equine genome sequence to identify predicted and known proteins using proteomics and model these proteins into biological pathways, identifying 582 proteins in normal cell-free equine bronchoalveolar lavage fluid (BALF). We improved structural and functional annotation by directly confirming the in vivo expression of 558 (96%) proteins, which were computationally predicted previously, and adding Gene Ontology (GO) annotations for 174 proteins, 108 of which lacked functional annotation. Bronchoalveolar lavage is commonly used to investigate equine respiratory disease, leading us to model the associated proteome and its biological functions. Modelling of protein functions using Ingenuity Pathway Analysis identified carbohydrate metabolism, cell-to-cell signalling, cellular function, inflammatory response, organ morphology, lipid metabolism and cellular movement as key biological processes in normal equine BALF. Comparative modelling of protein functions in normal cell-free bronchoalveolar lavage proteomes from horse, human, and mouse, performed by grouping GO terms sharing common ancestor terms, confirms conservation of functions across species. Ninety-one of 92 human GO categories and 105 of 109 mouse GO categories were conserved in the horse. Our approach confirms the utility of the equine genome sequence to characterize protein networks without antibodies or mRNA quantification, highlights the need for continued structural and functional annotation of the equine genome and provides a framework for equine researchers to aid in the annotation effort.

  18. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies.

    PubMed

    Lu, Qiongshi; Powles, Ryan Lee; Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-04-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.

  19. Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies

    PubMed Central

    Wang, Qian; He, Beixin Julie; Zhao, Hongyu

    2016-01-01

    Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline. PMID:27058395

  20. Plant Gene and Alternatively Spliced Variant Annotator. A plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species.

    PubMed

    Chen, Feng-Chi; Wang, Sheng-Shun; Chaw, Shu-Miaw; Huang, Yao-Ting; Chuang, Trees-Juen

    2007-03-01

    The completion of the rice (Oryza sativa) genome draft has brought unprecedented opportunities for genomic studies of the world's most important food crop. Previous rice gene annotations have relied mainly on ab initio methods, which usually yield a high rate of false-positive predictions and give only limited information regarding alternative splicing in rice genes. Comparative approaches based on expressed sequence tags (ESTs) can compensate for the drawbacks of ab initio methods because they can simultaneously identify experimental data-supported genes and alternatively spliced transcripts. Furthermore, cross-species EST information can be used to not only offset the insufficiency of same-species ESTs but also derive evolutionary implications. In this study, we used ESTs from seven plant species, rice, wheat (Triticum aestivum), maize (Zea mays), barley (Hordeum vulgare), sorghum (Sorghum bicolor), soybean (Glycine max), and Arabidopsis (Arabidopsis thaliana), to annotate the rice genome. We developed a plant genome annotation pipeline, Plant Gene and Alternatively Spliced Variant Annotator (PGAA). Using this approach, we identified 852 genes (931 isoforms) not annotated in other widely used databases (i.e. the Institute for Genomic Research, National Center for Biotechnology Information, and Rice Annotation Project) and found 87% of them supported by both rice and nonrice EST evidence. PGAA also identified more than 44,000 alternatively spliced events, of which approximately 20% are not observed in the other three annotations. These novel annotations represent rich opportunities for rice genome research, because the functions of most of our annotated genes are currently unknown. Also, in the PGAA annotation, the isoforms with non-rice-EST-supported exons are significantly enriched in transporter activity but significantly underrepresented in transcription regulator activity. We have also identified potential lineage-specific and conserved isoforms, which are

  1. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches

    SciTech Connect

    Yang, Xiaohan; Tschaplinski, Timothy J; Hurst, Gregory {Greg} B; Jawdy, Sara; Abraham, Paul E; Lankford, Patricia K; Adams, Rachel M; Shah, Manesh B; Hettich, Robert {Bob} L; Kalluri, Udaya C; Gunter, Lee E; Pennacchio, Christa; Tuskan, Gerald A

    2011-01-01

    Small proteins (10 200 amino acids (AA) in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained ~2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10 200 AA in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: 1) coding-potential prediction, 2) evolutionary conservation between P. deltoides and other plant species, and 3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1,469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1,282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were also supported by proteomics data. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.

  2. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation

    PubMed Central

    Grazziotin, Ana Laura; Koonin, Eugene V.; Kristensen, David M.

    2017-01-01

    Viruses are the most abundant and diverse biological entities on earth, and while most of this diversity remains completely unexplored, advances in genome sequencing have provided unprecedented glimpses into the virosphere. The Prokaryotic Virus Orthologous Groups (pVOGs, formerly called Phage Orthologous Groups, POGs) resource has aided in this task over the past decade by using automated methods to keep pace with the rapid increase in genomic data. The uses of pVOGs include functional annotation of viral proteins, identification of genes and viruses in uncharacterized DNA samples, phylogenetic analysis, large-scale comparative genomics projects, and more. The pVOGs database represents a comprehensive set of orthologous gene families shared across multiple complete genomes of viruses that infect bacterial or archaeal hosts (viruses of eukaryotes will be added at a future date). The pVOGs are constructed within the Clusters of Orthologous Groups (COGs) framework that is widely used for orthology identification in prokaryotes. Since the previous release of the POGs, the size has tripled to nearly 3000 genomes and 300 000 proteins, and the number of conserved orthologous groups doubled to 9518. User-friendly webpages are available, including multiple sequence alignments and HMM profiles for each VOG. These changes provide major improvements to the pVOGs database, at a time of rapid advances in virus genomics. The pVOGs database is hosted jointly at the University of Iowa at http://dmk-brain.ecn.uiowa.edu/pVOGs and the NCBI at ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/pVOGs/home.html. PMID:27789703

  3. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome

    PubMed Central

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A. V.

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license. PMID:26393794

  4. Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome.

    PubMed

    Ptitsyn, Andrey; Temanni, Ramzi; Bouchard, Christelle; Anderson, Peter A V

    2015-01-01

    Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

  5. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  6. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    PubMed

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac

  7. MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations1[W][OPEN

    PubMed Central

    Campbell, Michael S.; Law, MeiYee; Holt, Carson; Stein, Joshua C.; Moghe, Gaurav D.; Hufnagel, David E.; Lei, Jikai; Achawanantakun, Rujira; Jiao, Dian; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Childs, Kevin L.; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2014-01-01

    We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds. PMID:24306534

  8. The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations

    PubMed Central

    Cerqueira, Gustavo C.; Arnaud, Martha B.; Inglis, Diane O.; Skrzypek, Marek S.; Binkley, Gail; Simison, Matt; Miyasato, Stuart R.; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R.

    2014-01-01

    The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome. PMID:24194595

  9. Systematic genome-wide annotation of spliceosomal proteins reveals differential gene family expansion

    PubMed Central

    Barbosa-Morais, Nuno L.; Carmo-Fonseca, Maria; Aparício, Samuel

    2006-01-01

    Although more than 200 human spliceosomal and splicing-associated proteins are known, the evolution of the splicing machinery has not been studied extensively. The recent near-complete sequencing and annotation of distant vertebrate and chordate genomes provides the opportunity for an exhaustive comparative analysis of splicing factors across eukaryotes. We describe here our semiautomated computational pipeline to identify and annotate splicing factors in representative species of eukaryotes. We focused on protein families whose role in splicing is confirmed by experimental evidence. We visually inspected 1894 proteins and manually curated 224 of them. Our analysis shows a general conservation of the core spliceosomal proteins across the eukaryotic lineage, contrasting with selective expansions of protein families known to play a role in the regulation of splicing, most notably of SR proteins in metazoans and of heterogeneous nuclear ribonucleoproteins (hnRNP) in vertebrates. We also observed vertebrate-specific expansion of the CLK and SRPK kinases (which phosphorylate SR proteins), and the CUG-BP/CELF family of splicing regulators. Furthermore, we report several intronless genes amongst splicing proteins in mammals, suggesting that retrotransposition contributed to the complexity of the mammalian splicing apparatus. PMID:16344558

  10. Rapid Virulence Annotation (RVA): identification of virulence factors using a bacterial genome library and multiple invertebrate hosts.

    PubMed

    Waterfield, Nicholas R; Sanchez-Contreras, Maria; Eleftherianos, Ioannis; Dowling, Andrea; Yang, Guowei; Wilkinson, Paul; Parkhill, Julian; Thomson, Nicholas; Reynolds, Stuart E; Bode, Helge B; Dorus, Steven; Ffrench-Constant, Richard H

    2008-10-14

    Current sequence databases now contain numerous whole genome sequences of pathogenic bacteria. However, many of the predicted genes lack any functional annotation. We describe an assumption-free approach, Rapid Virulence Annotation (RVA), for the high-throughput parallel screening of genomic libraries against four different taxa: insects, nematodes, amoeba, and mammalian macrophages. These hosts represent different aspects of both the vertebrate and invertebrate immune system. Here, we apply RVA to the emerging human pathogen Photorhabdus asymbiotica using "gain of toxicity" assays of recombinant Escherichia coli clones. We describe a wealth of potential virulence loci and attribute biological function to several putative genomic islands, which may then be further characterized using conventional molecular techniques. The application of RVA to other pathogen genomes promises to ascribe biological function to otherwise uncharacterized virulence genes.

  11. MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data.

    PubMed

    Schmid, Christoph D; Sengstag, Thierry; Bucher, Philipp; Delorenzi, Mauro

    2007-07-01

    A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/.

  12. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  13. AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences

    PubMed Central

    Grau, Jan; Reschke, Maik; Erkes, Annett; Streubel, Jana; Morgan, Richard D.; Wilson, Geoffrey G.; Koebnik, Ralf; Boch, Jens

    2016-01-01

    Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present ‘AnnoTALE’, a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities. PMID:26876161

  14. Quantitative frame analysis and the annotation of GC-rich (and other) prokaryotic genomes. An application to Anaeromyxobacter dehalogenans

    PubMed Central

    Oden, Steve; Brocchieri, Luciano

    2015-01-01

    Motivation: Graphical representations of contrasts in GC usage among codon frame positions (frame analysis) provide evidence of genes missing from the annotations of prokaryotic genomes of high GC content but the qualitative approach of visual frame analysis prevents its applicability on a genomic scale. Results: We developed two quantitative methods for the identification and statistical characterization in sequence regions of three-base periodicity (hits) associated with open reading frame structures. The methods were implemented in the N-Profile Analysis Computational Tool (NPACT), which highlights in graphical representations inconsistencies between newly identified ORFs and pre-existing annotations of coding-regions. We applied the NPACT procedures to two recently annotated strains of the deltaproteobacterium Anaeromyxobacter dehalogenans, identifying in both genomes numerous conserved ORFs not included in the published annotation of coding regions. Availability and implementation: NPACT is available as a web-based service and for download at http://genome.ufl.edu/npact. Contact: lucianob@ufl.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26048600

  15. Corrected Genome Annotations Reveal Gene Loss and Antibiotic Resistance as Drivers in the Fitness Evolution of Salmonella enterica Serovar Typhimurium.

    PubMed

    Paul, Sandip; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2016-12-01

    Horizontal acquisition of novel chromosomal genes is considered to be a key process in the evolution of bacterial pathogens. However, the identification of gene presence or absence could be hindered by the inconsistencies in bacterial genome annotations. Here, we performed a cross-annotation of omnipresent core and mosaic accessory genes in the chromosome of Salmonella enterica serovar Typhimurium across a total of 20 fully assembled genomes deposited into GenBank. Cross-annotation resulted in a 32% increase in the number of core genes and a 3-fold decrease in the number of genes identified as mosaic genes (i.e., genes present in some strains only) by the original annotation. Of the remaining noncore genes, the vast majority were prophage genes, and 255 of the nonphage genes were actually of core origin but lost in some strains upon the emergence of the S Typhimurium serovar, suggesting that the chromosomal portion of the S Typhimurium genome acquired a very limited number of novel genes other than prophages. Only horizontally acquired nonphage genes related to bacterial fitness or virulence were found in four recently sequenced isolates, all located on three different genomic islands that harbor multidrug resistance determinants. Thus, the extensive use of antimicrobials could be the main selection force behind the new fitness gene acquisition and the emergence of novel Salmonella pathotypes.

  16. A kingdom-specific protein domain HMM library for improved annotation of fungal genomes

    PubMed Central

    Alam, Intikhab; Hubbard, Simon J; Oliver, Stephen G; Rattray, Magnus

    2007-01-01

    Background Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs), which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models. Results Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam) using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied. Conclusion Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam. Therefore, we recommend

  17. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots

    PubMed Central

    Kumar, Sujai; Jones, Martin; Koutsovoulos, Georgios; Clarke, Michael; Blaxter, Mark

    2013-01-01

    Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer. PMID

  18. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

    PubMed Central

    Andorf, Carson; Dobbs, Drena; Honavar, Vasant

    2007-01-01

    Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence. PMID:17683567

  19. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database.

    PubMed

    Engel, Stacia R; Cherry, J Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery.

  20. Discovery of germline-related genes in Cephalochordate amphioxus: A genome wide survey using genome annotation and transcriptome data.

    PubMed

    Yue, Jia-Xing; Li, Kun-Lung; Yu, Jr-Kai

    2015-12-01

    The generation of germline cells is a critical process in the reproduction of multicellular organisms. Studies in animal models have identified a common repertoire of genes that play essential roles in primordial germ cell (PGC) formation. However, comparative studies also indicate that the timing and regulation of this core genetic program vary considerably in different animals, raising the intriguing questions regarding the evolution of PGC developmental mechanisms in metazoans. Cephalochordates (commonly called amphioxus or lancelets) represent one of the invertebrate chordate groups and can provide important information about the evolution of developmental mechanisms in the chordate lineage. In this study, we used genome and transcriptome data to identify germline-related genes in two distantly related cephalochordate species, Branchiostoma floridae and Asymmetron lucayanum. Branchiostoma and Asymmetron diverged more than 120 MYA, and the most conspicuous difference between them is their gonadal morphology. We used important germline developmental genes in several model animals to search the amphioxus genome and transcriptome dataset for conserved homologs. We also annotated the assembled transcriptome data using Gene Ontology (GO) terms to facilitate the discovery of putative genes associated with germ cell development and reproductive functions in amphioxus. We further confirmed the expression of 14 genes in developing oocytes or mature eggs using whole mount in situ hybridization, suggesting their potential functions in amphioxus germ cell development. The results of this global survey provide a useful resource for testing potential functions of candidate germline-related genes in cephalochordates and for investigating differences in gonad developmental mechanisms between Branchiostoma and Asymmetron species.

  1. Target selection and annotation for the structural genomics of the amidohydrolase and enolase superfamilies.

    PubMed

    Pieper, Ursula; Chiang, Ranyee; Seffernick, Jennifer J; Brown, Shoshana D; Glasner, Margaret E; Kelly, Libusha; Eswar, Narayanan; Sauder, J Michael; Bonanno, Jeffrey B; Swaminathan, Subramanyam; Burley, Stephen K; Zheng, Xiaojing; Chance, Mark R; Almo, Steven C; Gerlt, John A; Raushel, Frank M; Jacobson, Matthew P; Babbitt, Patricia C; Sali, Andrej

    2009-04-01

    To study the substrate specificity of enzymes, we use the amidohydrolase and enolase superfamilies as model systems; members of these superfamilies share a common TIM barrel fold and catalyze a wide range of chemical reactions. Here, we describe a collaboration between the Enzyme Specificity Consortium (ENSPEC) and the New York SGX Research Center for Structural Genomics (NYSGXRC) that aims to maximize the structural coverage of the amidohydrolase and enolase superfamilies. Using sequence- and structure-based protein comparisons, we first selected 535 target proteins from a variety of genomes for high-throughput structure determination by X-ray crystallography; 63 of these targets were not previously annotated as superfamily members. To date, 20 unique amidohydrolase and 41 unique enolase structures have been determined, increasing the fraction of sequences in the two superfamilies that can be modeled based on at least 30% sequence identity from 45% to 73%. We present case studies of proteins related to uronate isomerase (an amidohydrolase superfamily member) and mandelate racemase (an enolase superfamily member), to illustrate how this structure-focused approach can be used to generate hypotheses about sequence-structure-function relationships.

  2. Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction.

    PubMed

    Masseroli, Marco; Canakoglu, Arif; Ceri, Stefano

    2016-01-01

    Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists' ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome, and OMIM). Our solution is general, as it uses a flexible, modular, and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure, and number. These procedures also assure consistency, quality, and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction.

  3. Structural cuticular proteins from arthropods: annotation, nomenclature, and sequence characteristics in the genomics era

    PubMed Central

    Willis, Judith H.

    2010-01-01

    The availability of whole genome sequences of several arthropods has provided new insights into structural cuticular proteins (CPs), in particular the distribution of different families, the recognition that these proteins may comprise almost 2% of the protein coding genes of some species, and the identification of features that should aid in the annotation of new genomes and EST libraries as they become available. Twelve CP families are described: CPR (named after the Rebers and Riddiford Consensus); CPF (named because it has a highly conserved region consisting of about forty-four amino acids); CPFL (like the CPFs in a conserved C-terminal region); the TWDL family, named after a picturesque phenotype of one mutant member; four families in addition to TWDL with a preponderance of low complexity sequence that are not member of the families listed above. These were named after particular diagnostic features as CPLCA, CPLCG, CPLCW, CPLCP. There are also CPG, a lepidopteran family with an abundance of glycines, the apidermin family, named after three proteins in Apis mellifera, and CPAP1 and CPAP3, named because they have features analogous to peritrophins, namely one or three chitin-binding domains. Also described are common motifs and features. Four unusual CPs are discussed in detail. Data that facilitated the analysis of sequence variation of single CP genes in natural populations are analyzed. PMID:20171281

  4. SearchDOGS Bacteria, Software That Provides Automated Identification of Potentially Missed Genes in Annotated Bacterial Genomes

    PubMed Central

    ÓhÉigeartaigh, Seán S.; Armisén, David; Byrne, Kevin P.

    2014-01-01

    We report the development of SearchDOGS Bacteria, software to automatically detect missing genes in annotated bacterial genomes by combining BLAST searches with comparative genomics. Having successfully applied the approach to yeast genomes, we redeveloped SearchDOGS to function as a standalone, downloadable package, requiring only a set of GenBank annotation files as input. The software automatically generates a homology structure using reciprocal BLAST and a synteny-based method; this is followed by a scan of the entire genome of each species for unannotated genes. Results are provided in a HTML interface, providing coordinates, BLAST results, syntenic location, omega values (Ka/Ks, where Ks is the number of synonymous substitutions per synonymous site and Ka is the number of nonsynonymous substitutions per nonsynonymous site) for protein conservation estimates, and other information for each candidate gene. Using SearchDOGS Bacteria, we identified 155 gene candidates in the Shigella boydii sb227 genome, including 56 candidates of length < 60 codons. SearchDOGS Bacteria has two major advantages over currently available annotation software. First, it outperforms current methods in terms of sensitivity and is highly effective at identifying small or highly diverged genes. Second, as a freely downloadable package, it can be used with unpublished or confidential data. PMID:24659774

  5. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  6. Putative drug and vaccine target protein identification using comparative genomic analysis of KEGG annotated metabolic pathways of Mycoplasma hyopneumoniae.

    PubMed

    Damte, Dereje; Suh, Joo-Won; Lee, Seung-Jin; Yohannes, Sileshi Belew; Hossain, Md Akil; Park, Seung-Chun

    2013-07-01

    In the present study, a computational comparative and subtractive genomic/proteomic analysis aimed at the identification of putative therapeutic target and vaccine candidate proteins from Kyoto Encyclopedia of Genes and Genomes (KEGG) annotated metabolic pathways of Mycoplasma hyopneumoniae was performed for drug design and vaccine production pipelines against M.hyopneumoniae. The employed comparative genomic and metabolic pathway analysis with a predefined computational systemic workflow extracted a total of 41 annotated metabolic pathways from KEGG among which five were unique to M. hyopneumoniae. A total of 234 proteins were identified to be involved in these metabolic pathways. Although 125 non homologous and predicted essential proteins were found from the total that could serve as potential drug targets and vaccine candidates, additional prioritizing parameters characterize 21 proteins as vaccine candidate while druggability of each of the identified proteins evaluated by the DrugBank database prioritized 42 proteins suitable for drug targets.

  7. Next-Generation Annotation of Prokaryotic Genomes with EuGene-P: Application to Sinorhizobium meliloti 2011

    PubMed Central

    Sallet, Erika; Roux, Brice; Sauviac, Laurent; Jardinaud, Marie-Franc¸oise; Carrère, Sébastien; Faraut, Thomas; de Carvalho-Niebel, Fernanda; Gouzy, Jérôme; Gamas, Pascal; Capela, Delphine; Bruand, Claude; Schiex, Thomas

    2013-01-01

    The availability of next-generation sequences of transcripts from prokaryotic organisms offers the opportunity to design a new generation of automated genome annotation tools not yet available for prokaryotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the prediction process. This enables the automated prediction of coding sequences (CDSs), untranslated regions, transcription start sites (TSSs) and non-coding RNA (ncRNA, sense and antisense) genes. EuGene-P was used to comprehensively and accurately annotate the genome of the nitrogen-fixing bacterium Sinorhizobium meliloti strain 2011, leading to the prediction of 6308 CDSs as well as 1876 ncRNAs. Among them, 1280 appeared as antisense to a CDS, which supports recent findings that antisense transcription activity is widespread in bacteria. Moreover, 4077 TSSs upstream of protein-coding or non-coding genes were precisely mapped providing valuable data for the study of promoter regions. By looking for RpoE2-binding sites upstream of annotated TSSs, we were able to extend the S. meliloti RpoE2 regulon by ∼3-fold. Altogether, these observations demonstrate the power of EuGene-P to produce a reliable and high-resolution automatic annotation of prokaryotic genomes. PMID:23599422

  8. Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains

    PubMed Central

    Lewis, Tony E.; Sillitoe, Ian; Andreeva, Antonina; Blundell, Tom L.; Buchan, Daniel W.A.; Chothia, Cyrus; Cuff, Alison; Dana, Jose M.; Filippis, Ioannis; Gough, Julian; Hunter, Sarah; Jones, David T.; Kelley, Lawrence A.; Kleywegt, Gerard J.; Minneci, Federico; Mitchell, Alex; Murzin, Alexey G.; Ochoa-Montaño, Bernardo; Rackham, Owen J. L.; Smith, James; Sternberg, Michael J. E.; Velankar, Sameer; Yeats, Corin; Orengo, Christine

    2013-01-01

    Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs). PMID:23203986

  9. Resequencing and annotation of the Nostoc punctiforme ATTC 29133 genome: facilitating biofuel and high-value chemical production.

    PubMed

    Moraes, Luis E; Blow, Matthew J; Hawley, Erik R; Piao, Hailan; Kuo, Rita; Chiniquy, Jennifer; Shapiro, Nicole; Woyke, Tanja; Fadel, James G; Hess, Matthias

    2017-12-01

    Cyanobacteria have the potential to produce bulk and fine chemicals and members belonging to Nostoc sp. have received particular attention due to their relatively fast growth rate and the relative ease with which they can be harvested. Nostoc punctiforme is an aerobic, motile, Gram-negative, filamentous cyanobacterium that has been studied intensively to enhance our understanding of microbial carbon and nitrogen fixation. The genome of the type strain N. punctiforme ATCC 29133 was sequenced in 2001 and the scientific community has used these genome data extensively since then. Advances in bioinformatics tools for sequence annotation and the importance of this organism prompted us to resequence and reanalyze its genome and to make both, the initial and improved annotation, available to the scientific community. The new draft genome has a total size of 9.1 Mbp and consists of 65 contiguous pieces of DNA with a GC content of 41.38% and 7664 protein-coding genes. Furthermore, the resequenced genome is slightly (5152 bp) larger and contains 987 more genes with functional prediction when compared to the previously published version. We deposited the annotation of both genomes in the Department of Energy's IMG database to facilitate easy genome exploration by the scientific community without the need of in-depth bioinformatics skills. We expect that an facilitated access and ability to search the N. punctiforme ATCC 29133 for genes of interest will significantly facilitate metabolic engineering and genome prospecting efforts and ultimately the synthesis of biofuels and natural products from this keystone organism and closely related cyanobacteria.

  10. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    PubMed

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  11. Gene prediction and annotation in Penstemon (Plantaginaceae): A workflow for marker development from extremely low-coverage genome sequencing1

    PubMed Central

    Blischak, Paul D.; Wenzel, Aaron J.; Wolfe, Andrea D.

    2014-01-01

    • Premise of the study: Penstemon (Plantaginaceae) is a large and diverse genus endemic to North America. However, determining the phylogenetic relationships among its 280 species has been difficult due to its recent evolutionary radiation. The development of a large, multilocus data set can help to resolve this challenge. • Methods: Using both previously sequenced genomic libraries and our own low-coverage whole-genome shotgun sequencing libraries, we used the MAKER2 Annotation Pipeline to identify gene regions for the development of sequencing loci from six extremely low-coverage Penstemon genomes (∼0.005×–0.007×). We also compared this approach to BLAST searches, and conducted analyses to characterize sequence divergence across the species sequenced. • Results: Annotations and gene predictions were successfully added to more than 10,000 contigs for potential use in downstream primer design. Primers were then designed for chloroplast, mitochondrial, and nuclear loci from these annotated sequences. MAKER2 identified longer gene regions in all six Penstemon genomes when compared with BLASTN and BLASTX searches. The average level of sequence divergence among the six species was 7.14%. • Discussion: Combining bioinformatics tools into a workflow that produces annotations can be useful for creating potential phylogenetic markers from thousands of sequences even when genome coverage is extremely low and reference data are only available from distant relatives. Furthermore, the output from MAKER2 contains information about important gene features, such as exon boundaries, and can be easily integrated with visualization tools to facilitate the process of marker development. PMID:25506519

  12. First Draft Assembly and Annotation of the Genome of a California Endemic Oak Quercus lobata Née (Fagaceae)

    PubMed Central

    Sork, Victoria L.; Fitz-Gibbon, Sorel T.; Puiu, Daniela; Crepeau, Marc; Gugger, Paul F.; Sherman, Rachel; Stevens, Kristian; Langley, Charles H.; Pellegrini, Matteo; Salzberg, Steven L.

    2016-01-01

    Oak represents a valuable natural resource across Northern Hemisphere ecosystems, attracting a large research community studying its genetics, ecology, conservation, and management. Here we introduce a draft genome assembly of valley oak (Quercus lobata) using Illumina sequencing of adult leaf tissue of a tree found in an accessible, well-studied, natural southern California population. Our assembly includes a nuclear genome and a complete chloroplast genome, along with annotation of encoded genes. The assembly contains 94,394 scaffolds, totaling 1.17 Gb with 18,512 scaffolds of length 2 kb or longer, with a total length of 1.15 Gb, and a N50 scaffold size of 278,077 kb. The k-mer histograms indicate an diploid genome size of ∼720–730 Mb, which is smaller than the total length due to high heterozygosity, estimated at 1.25%. A comparison with a recently published European oak (Q. robur) nuclear sequence indicates 93% similarity. The Q. lobata chloroplast genome has 99% identity with another North American oak, Q. rubra. Preliminary annotation yielded an estimate of 61,773 predicted protein-coding genes, of which 71% had similarity to known protein domains. We searched 956 Benchmarking Universal Single-Copy Orthologs, and found 863 complete orthologs, of which 450 were present in > 1 copy. We also examined an earlier version (v0.5) where duplicate haplotypes were removed to discover variants. These additional sources indicate that the predicted gene count in Version 1.0 is overestimated by 37–52%. Nonetheless, this first draft valley oak genome assembly represents a high-quality, well-annotated genome that provides a tool for forest restoration and management practices. PMID:27621377

  13. Transcript Mapping and Genome Annotation of Ascidian mtDNA Using EST Data

    PubMed Central

    Gissi, Carmela; Pesole, Graziano

    2003-01-01

    Mitochondrial transcripts of two ascidian species were reconstructed through sequence assembly of publicly available ESTs resembling mitochondrial DNA sequences (mt-ESTs). This strategy allowed us to analyze processing and mapping of the mitochondrial transcripts and to investigate the gene organization of a previously uncharacterized mitochondrial genome (mtDNA). This new strategy would greatly facilitate the sequencing and annotation of mtDNAs. In Ciona intestinalis, the assembled mt-ESTs covered 22 mitochondrial genes (∼12,000 bp) and provided the partial sequence of the mtDNA and the prediction of its gene organization. Such sequences were confirmed by amplification and sequencing of the entire Ciona mtDNA. For Halocynthia roretzi, for which the mtDNA sequence was already available, the inferred mt transcripts allowed better definition of gene boundaries (16S rRNA, ND1, ATP6, and tRNA-Ser genes) and the identification of a new gene (an additional Phe-tRNA). In both species, polycistronic and immature transcripts, creation of stop codons by polyadenylation, tRNA signal processing, and rRNA transcript termination signals were identified, thus suggesting that the main features of mitochondrial transcripts are conserved in Chordata. PMID:12915488

  14. Evaluation of relational and NoSQL database architectures to manage genomic annotations.

    PubMed

    Schulz, Wade L; Nelson, Brent G; Felker, Donn K; Durant, Thomas J S; Torres, Richard

    2016-12-01

    While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences.

  15. Narrowing and genomic annotation of the commonly deleted region of the 5q- syndrome

    SciTech Connect

    Boultwood, Jacqueline; Fidler, Carrie; Strickson, Amanda J.; Watkins, Fiona; Gama, Susana; Kearney, Lyndal; Tosi, Sabrina; Kasprzyk, Arek; Cheng, Jan-Fang; Jaju, Rina J.; Wainscoat, James S.

    2002-01-15

    The 5q syndrome is the most distinct of the myelodysplastic syndromes, and the molecular basis for this disorder remains unknown. We describe the narrowing of the common deleted region (CDR) of the 5q syndrome to the approximately 1.5-megabases interval at 5q32 flanked by D5S413 and the GLRA1 gene. The Ensemblgene prediction program has been used for the complete genomic annotation of the CDR. The CDR is gene rich and contains 24 known genes and 16 novel (predicted) genes. Of 40 genes in the CDR, 33 are expressed in CD34 cells and, therefore, represent candidate genes since they are expressed within the hematopoietic stem/progenitor cell compartment. A number of the genes assigned to the CDR represent good candidates for the 5q syndrome, including MEGF1, G3BP, and several of the novel gene predictions. These data now afford a comprehensive mutational/expression analysis of all candidate genes assigned to the CDR.

  16. Comparison of Functional Gene Annotation of Toxascaris leonina and Toxocara canis using CLC Genomics Workbench

    PubMed Central

    Kim, Ki Uk; Park, Sang Kyun; Kang, Shin Ae; Park, Mi Kyung; Cho, Min Kyoung; Jung, Ho-jin; Kim, Kyung-Yun

    2013-01-01

    The ascarids, Toxocara canis and Toxascaris leonina, are probably the most common gastrointestinal helminths encountered in dogs. In order to understand biological differences of 2 ascarids, we analyzed gene expression profiles of female adults of T. canis and T. leonina using CLC Genomics Workbench, and the results were compared with those of free-living nematode Caenorhabditis elegans. A total of 2,880 and 7,949 ESTs were collected from T. leonina and T. canis, respectively. The length of ESTs ranged from 106 to 4,637 bp with an average insert size of 820 bp. Overall, our results showed that most functional gene annotations of 2 ascarids were quite similar to each other in 3 major categories, i.e., cellular component, biological process, and molecular function. Although some different transcript expression categories were found, the distance was short and it was not enough to explain their different lifestyles. However, we found distinguished transcript differences between ascarid parasites and free-living nematodes. Understanding evolutionary genetic changes might be helpful for studies of the lifestyle and evolution of parasites. PMID:24327777

  17. Comparison of functional gene annotation of Toxascaris leonina and Toxocara canis using CLC genomics workbench.

    PubMed

    Kim, Ki Uk; Park, Sang Kyun; Kang, Shin Ae; Park, Mi Kyung; Cho, Min Kyoung; Jung, Ho-Jin; Kim, Kyung-Yun; Yu, Hak Sun

    2013-10-01

    The ascarids, Toxocara canis and Toxascaris leonina, are probably the most common gastrointestinal helminths encountered in dogs. In order to understand biological differences of 2 ascarids, we analyzed gene expression profiles of female adults of T. canis and T. leonina using CLC Genomics Workbench, and the results were compared with those of free-living nematode Caenorhabditis elegans. A total of 2,880 and 7,949 ESTs were collected from T. leonina and T. canis, respectively. The length of ESTs ranged from 106 to 4,637 bp with an average insert size of 820 bp. Overall, our results showed that most functional gene annotations of 2 ascarids were quite similar to each other in 3 major categories, i.e., cellular component, biological process, and molecular function. Although some different transcript expression categories were found, the distance was short and it was not enough to explain their different lifestyles. However, we found distinguished transcript differences between ascarid parasites and free-living nematodes. Understanding evolutionary genetic changes might be helpful for studies of the lifestyle and evolution of parasites.

  18. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    SciTech Connect

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to

  19. First Draft Assembly and Annotation of the Genome of a California Endemic Oak Quercus lobata Née (Fagaceae).

    PubMed

    Sork, Victoria L; Fitz-Gibbon, Sorel T; Puiu, Daniela; Crepeau, Marc; Gugger, Paul F; Sherman, Rachel; Stevens, Kristian; Langley, Charles H; Pellegrini, Matteo; Salzberg, Steven L

    2016-09-12

    Oaks represent a valuable natural resource across Northern Hemisphere ecosystems attracting a large research community studying its genetics, ecology, conservation, and management. Here we introduce a draft genome assembly of valley oak (Quercus lobate) using Illumina sequencing of adult leaf tissue of an tree found in an accessible, well-studied, natural southern California population. Our assembly includes a nuclear and complete chloroplast genome, along with annotation of encoded genes. The assembly contains 94,394 scaffolds totaling 1.17 Gb with 18,512 scaffolds of length 2 kb or longer, with a total length of 1.15 Gb and a N50 scaffold size of 278,077 kb. The k-mer histograms indicate an approximate diploid genome size of 720-730 Mb, which is smaller than the total length due to high heterozygosity, estimated at 1.25%. A comparison with a recently published European oak (Q. robur) nuclear sequence indicates 93% similarity. The Q. lobate chloroplast genome has 99% identity with another North American oak, Q. rubra Preliminary annotation yielded an estimate of 61,773 predicted protein-coding genes, of which 71% had similarity to known protein domains. We searched 956 Benchmarking Universal Single-Copy Orthologs and found 863 complete orthologs of which 450 were present in more than one copy. We also examined an earlier version (v0.5) where duplicate haplotypes were removed to discover variants. These additional sources indicate that the predicted gene count in Version1.0 is over-estimated by 37-52%. Nonetheless, this first draft valley oak genome assembly represents a high-quality, well-annotated genome that provides a tool for forest restoration and management practices.

  20. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    DOE PAGES

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; ...

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genesmore » and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary

  1. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models.

    PubMed

    Benedict, Matthew N; Mundy, Michael B; Henry, Christopher S; Chia, Nicholas; Price, Nathan D

    2014-10-01

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to

  2. Generation, functional annotation and comparative analysis of black spruce (Picea mariana) ESTs: an important conifer genomic resource

    PubMed Central

    2013-01-01

    Background EST (expressed sequence tag) sequences and their annotation provide a highly valuable resource for gene discovery, genome sequence annotation, and other genomics studies that can be applied in genetics, breeding and conservation programs for non-model organisms. Conifers are long-lived plants that are ecologically and economically important globally, and have a large genome size. Black spruce (Picea mariana), is a transcontinental species of the North American boreal and temperate forests. However, there are limited transcriptomic and genomic resources for this species. The primary objective of our study was to develop a black spruce transcriptomic resource to facilitate on-going functional genomics projects related to growth and adaptation to climate change. Results We conducted bidirectional sequencing of cDNA clones from a standard cDNA library constructed from black spruce needle tissues. We obtained 4,594 high quality (2,455 5' end and 2,139 3' end) sequence reads, with an average read-length of 532 bp. Clustering and assembly of ESTs resulted in 2,731 unique sequences, consisting of 2,234 singletons and 497 contigs. Approximately two-thirds (63%) of unique sequences were functionally annotated. Genes involved in 36 molecular functions and 90 biological processes were discovered, including 24 putative transcription factors and 232 genes involved in photosynthesis. Most abundantly expressed transcripts were associated with photosynthesis, growth factors, stress and disease response, and transcription factors. A total of 216 full-length genes were identified. About 18% (493) of the transcripts were novel, representing an important addition to the Genbank EST database (dbEST). Fifty-seven di-, tri-, tetra- and penta-nucleotide simple sequence repeats were identified. Conclusions We have developed the first high quality EST resource for black spruce and identified 493 novel transcripts, which may be species-specific related to life history and

  3. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  4. Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

    PubMed Central

    Marko, Nicholas F.; Weil, Robert J.

    2012-01-01

    Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects. PMID:23118863

  5. The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database

    PubMed Central

    Engel, Stacia R.; Cherry, J. Michael

    2013-01-01

    The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186

  6. Improved genome annotation through untargeted detection of pathway-specific metabolites

    PubMed Central

    2011-01-01

    Background Mass spectrometry-based metabolomics analyses have the potential to complement sequence-based methods of genome annotation, but only if raw mass spectral data can be linked to specific metabolic pathways. In untargeted metabolomics, the measured mass of a detected compound is used to define the location of the compound in chemical space, but uncertainties in mass measurements lead to "degeneracies" in chemical space since multiple chemical formulae correspond to the same measured mass. We compare two methods to eliminate these degeneracies. One method relies on natural isotopic abundances, and the other relies on the use of stable-isotope labeling (SIL) to directly determine C and N atom counts. Both depend on combinatorial explorations of the "chemical space" comprised of all possible chemical formulae comprised of biologically relevant chemical elements. Results Of 1532 metabolic pathways curated in the MetaCyc database, 412 contain a metabolite having a chemical formula unique to that metabolic pathway. Thus, chemical formulae alone can suffice to infer the presence of some metabolic pathways. Of 248,928 unique chemical formulae selected from the PubChem database, more than 95% had at least one degeneracy on the basis of accurate mass information alone. Consideration of natural isotopic abundance reduced degeneracy to 64%, but mainly for formulae less than 500 Da in molecular weight, and only if the error in the relative isotopic peak intensity was less than 10%. Knowledge of exact C and N atom counts as determined by SIL enabled reduced degeneracy, allowing for determination of unique chemical formula for 55% of the PubChem formulae. Conclusions To facilitate the assignment of chemical formulae to unknown mass-spectral features, profiling can be performed on cultures uniformly labeled with stable isotopes of nitrogen (15N) or carbon (13C). This makes it possible to accurately count the number of carbon and nitrogen atoms in each molecule, providing a

  7. On genome annotation of Brucellaphage Gadvasu (BpG): discovery of ORFans for integrated systems biology approaches.

    PubMed

    Chachra, Deepti; Kaur, Pushpinder; Siddavatam, Prasad; Suravajhala, Prashanth; Saxena, Hari Mohan

    2015-12-01

    Brucellaphage Gadvasu (BpG) is a lytic phage infecting Brucella spp. Brucellaphages contain dsDNA as genetic material and are short-tailed particles with host-specificity. Here, we report the challenges on annotation in the complete genome sequence of BpG when compared with that of a recent broad host-range brucellaphage Pr, an original reference genome. The extracted DNA was subjected to genome sequencing with Illumina technology and assembled using SSAKE/Velvet. A significant number of genes were found to be similar between the phages with sequence analysis revealing conserved open reading frames that correspond to 33 gene ontology classifiers, transcriptional terminators and a few putative transcriptional promoters. The analyses revealed that the genome constitutes 1269 contigs and 275 genes encoding 260 proteins. The sequence comparison from the reference data indicated that the genome shares an approximately 70 % nucleotide similarity and differs mainly in the region encoding proteins. We bring this commentary providing an overview of how this exemplar genome can allow us to understand these known unknown regions in brucellaphages.

  8. Ontology-based annotations and semantic relations in large-scale (epi)genomics data.

    PubMed

    Galeota, Eugenia; Pelizzola, Mattia

    2016-05-03

    Public repositories of large-scale biological data currently contain hundreds of thousands of experiments, including high-throughput sequencing and microarray data. The potential of using these resources to assemble data sets combining samples previously not associated is vastly unexplored. This requires the ability to associate samples with clear annotations and to relate experiments matched with different annotation terms. In this study, we illustrate the semantic annotation of Gene Expression Omnibus samples metadata using concepts from biomedical ontologies, focusing on the association of thousands of chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) samples with a given target, tissue and disease state. Next, we demonstrate the feasibility of quantitatively measuring the semantic similarity between different samples, with the aim of combining experiments associated with the same or similar semantic annotations, thus allowing the generation of large data sets without the need of additional experiments. We compared tools based on Unified Medical Language System with tools that use topic-specific ontologies, showing that the second approach outperforms the first both in the annotation process and in the computation of semantic similarity measures. Finally, we demonstrated the potential of this approach by identifying semantically homogeneous groups of ChIP-seq samples targeting the Myc transcription factor, and expanding this data set with semantically coherent epigenetic samples. The semantic information of these data sets proved to be coherent with the ChIP-seq signal and with the current knowledge about this transcription factor.

  9. The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation

    PubMed Central

    Xia, Dong; Sanderson, Sanya J; Jones, Andrew R; Prieto, Judith H; Yates, John R; Bromley, Elizabeth; Tomley, Fiona M; Lal, Kalpana; Sinden, Robert E; Brunk, Brian P; Roos, David S; Wastling, Jonathan M

    2008-01-01

    Background Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource. Results The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression. Conclusion Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible. PMID:18644147

  10. [Whole-genome amplification by MDA to improve sensitivity of forensic expert examination of chromosomal DNA].

    PubMed

    Ivanov, P L; Fomichev, A A

    2008-01-01

    This pilot project has the objective to evaluate the possibility of application of the multiple displacement amplification (MDA) technique to whole-genome amplification with a view to improving sensitivity of molecular-genetic test-systems. Preparations of total cellular DNA were amplified by MDA and analysed to assess conserved specificity of chromosomal DNA and its enhanced template activity in the standard polymerase chain reaction (PCR) for typing allele variants of polymorphous DNA loci. DNA samples before and after MDA showed virtually identical genotypic combinations of alleles. Allele fragments were stably detected at a level of DNA 4-5 times lower than in the standard test. The results of the study indicate that the MDA technique provides a promising tool to improve reliability of forensic- expert examination of chromosomal DNA and imply the necessity to further develop forensic-medical aspects of this method.

  11. Annotation of two large contiguous regions from the Haemonchus contortus genome using RNA-seq and comparative analysis with Caenorhabditis elegans.

    PubMed

    Laing, Roz; Hunt, Martin; Protasio, Anna V; Saunders, Gary; Mungall, Karen; Laing, Steven; Jackson, Frank; Quail, Michael; Beech, Robin; Berriman, Matthew; Gilleard, John S

    2011-01-01

    The genomes of numerous parasitic nematodes are currently being sequenced, but their complexity and size, together with high levels of intra-specific sequence variation and a lack of reference genomes, makes their assembly and annotation a challenging task. Haemonchus contortus is an economically significant parasite of livestock that is widely used for basic research as well as for vaccine development and drug discovery. It is one of many medically and economically important parasites within the strongylid nematode group. This group of parasites has the closest phylogenetic relationship with the model organism Caenorhabditis elegans, making comparative analysis a potentially powerful tool for genome annotation and functional studies. To investigate this hypothesis, we sequenced two contiguous fragments from the H. contortus genome and undertook detailed annotation and comparative analysis with C. elegans. The adult H. contortus transcriptome was sequenced using an Illumina platform and RNA-seq was used to annotate a 409 kb overlapping BAC tiling path relating to the X chromosome and a 181 kb BAC insert relating to chromosome I. In total, 40 genes and 12 putative transposable elements were identified. 97.5% of the annotated genes had detectable homologues in C. elegans of which 60% had putative orthologues, significantly higher than previous analyses based on EST analysis. Gene density appears to be less in H. contortus than in C. elegans, with annotated H. contortus genes being an average of two-to-three times larger than their putative C. elegans orthologues due to a greater intron number and size. Synteny appears high but gene order is generally poorly conserved, although areas of conserved microsynteny are apparent. C. elegans operons appear to be partially conserved in H. contortus. Our findings suggest that a combination of RNA-seq and comparative analysis with C. elegans is a powerful approach for the annotation and analysis of strongylid nematode genomes.

  12. SNPMeta: SNP annotation and SNP metadata collection without a reference genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increase in availability of resequencing data is greatly accelerating SNP discovery and has facilitated the development of SNP genotyping assays. This, in turn, is increasing interest in annotation of individual SNPs. Currently, these data are only available through curation, or comparison to a ...

  13. KSHV 2.0: a comprehensive annotation of the Kaposi's sarcoma-associated herpesvirus genome using next-generation sequencing reveals novel genomic and functional features.

    PubMed

    Arias, Carolina; Weisburd, Ben; Stern-Ginossar, Noam; Mercier, Alexandre; Madrid, Alexis S; Bellare, Priya; Holdorf, Meghan; Weissman, Jonathan S; Ganem, Don

    2014-01-01

    Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs), and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus.

  14. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium.

    PubMed

    Gaudet, Pascale; Livstone, Michael S; Lewis, Suzanna E; Thomas, Paul D

    2011-09-01

    The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.

  15. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  16. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  17. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL:http://rgd.mcw.edu.

  18. The Disease Portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database

    PubMed Central

    Hayman, G. Thomas; Laulederkind, Stanley J. F.; Smith, Jennifer R.; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene–disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL: http://rgd.mcw.edu PMID:27009807

  19. Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology

    PubMed Central

    Fontana, Paolo; Cestaro, Alessandro; Velasco, Riccardo; Formentin, Elide; Toppo, Stefano

    2009-01-01

    Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage. PMID:19247487

  20. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    ERIC Educational Resources Information Center

    Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day…

  1. Carbohydrate catabolic flexibility in the mammalian intestinal commensal Lactobacillus ruminis revealed by fermentation studies aligned to genome annotations

    PubMed Central

    2011-01-01

    Background Lactobacillus ruminis is a poorly characterized member of the Lactobacillus salivarius clade that is part of the intestinal microbiota of pigs, humans and other mammals. Its variable abundance in human and animals may be linked to historical changes over time and geographical differences in dietary intake of complex carbohydrates. Results In this study, we investigated the ability of nine L. ruminis strains of human and bovine origin to utilize fifty carbohydrates including simple sugars, oligosaccharides, and prebiotic polysaccharides. The growth patterns were compared with metabolic pathways predicted by annotation of a high quality draft genome sequence of ATCC 25644 (human isolate) and the complete genome of ATCC 27782 (bovine isolate). All of the strains tested utilized prebiotics including fructooligosaccharides (FOS), soybean-oligosaccharides (SOS) and 1,3:1,4-β-D-gluco-oligosaccharides to varying degrees. Six strains isolated from humans utilized FOS-enriched inulin, as well as FOS. In contrast, three strains isolated from cows grew poorly in FOS-supplemented medium. In general, carbohydrate utilisation patterns were strain-dependent and also varied depending on the degree of polymerisation or complexity of structure. Six putative operons were identified in the genome of the human isolate ATCC 25644 for the transport and utilisation of the prebiotics FOS, galacto-oligosaccharides (GOS), SOS, and 1,3:1,4-β-D-Gluco-oligosaccharides. One of these comprised a novel FOS utilisation operon with predicted capacity to degrade chicory-derived FOS. However, only three of these operons were identified in the ATCC 27782 genome that might account for the utilisation of only SOS and 1,3:1,4-β-D-Gluco-oligosaccharides. Conclusions This study has provided definitive genome-based evidence to support the fermentation patterns of nine strains of Lactobacillus ruminis, and has linked it to gene distribution patterns in strains from different sources. Furthermore

  2. Reconciliation of Sequence Data and Updated Annotation of the Genome of Agrobacterium tumefaciens C58, and Distribution of a Linear Chromosome in the Genus Agrobacterium

    PubMed Central

    Slater, Steven; Setubal, João C.; Houmiel, Kathryn; Sun, Jian; Kaul, Rajinder; Goldman, Barry S.; Farrand, Stephen K.; Almeida, Nalvo; Burr, Thomas; Nester, Eugene; Rhoads, David M.; Kadoi, Ryosuke; Ostheimer, Trucian; Pride, Nicole; Sabo, Allison; Henry, Erin; Telepak, Erin; Cromes, Lindsey; Harkleroad, Alana; Oliphant, Louis; Pratt-Szegila, Phil; Welch, Roy; Wood, Derek

    2013-01-01

    Two groups independently sequenced the Agrobacterium tumefaciens C58 genome in 2001. We report here consolidation of these sequences, updated annotation, and additional analysis of the evolutionary history of the linear chromosome, which is apparently limited to the biovar I group of Agrobacterium. PMID:23241979

  3. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium.

    PubMed

    Williams, Baraka S; Isokpehi, Raphael D; Mbah, Andreas N; Hollman, Antoinesha L; Bernard, Christina O; Simmons, Shaneka S; Ayensu, Wellington K; Garner, Bianca L

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  4. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium

    PubMed Central

    Williams, Baraka S.; Isokpehi, Raphael D.; Mbah, Andreas N.; Hollman, Antoinesha L.; Bernard, Christina O.; Simmons, Shaneka S.; Ayensu, Wellington K.; Garner, Bianca L.

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional

  5. From sequencing to annotating: extending the metaphor of the book of life from genetics to genomics.

    PubMed

    Hellsten, Iina

    2005-12-01

    The article discusses how the metaphor of the Book of Life was extended over time to cover the life cycle of the Human Genome Project from genetics to genomics. In particular, the focus is on the role of extendable metaphors in the debate on the Human Genome Project in three European newspapers, popular scientific journals and scientific and scholarly articles from 1990 to 2002. In these different domains of use, various parts of the metaphor were highlighted. The metaphor of Book of Life was mainly used to justify the continuation of the gene research from gene sequencing to comparative genomics. Readily extendable metaphors, such as the Book of Life, function as useful communicative tools both over time and across domains of use.

  6. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.

  7. KEGG orthology-based annotation of the predicted proteome of Acropora digitifera: ZoophyteBase - an open access and searchable database of a coral genome

    PubMed Central

    2013-01-01

    Background Contemporary coral reef research has firmly established that a genomic approach is urgently needed to better understand the effects of anthropogenic environmental stress and global climate change on coral holobiont interactions. Here we present KEGG orthology-based annotation of the complete genome sequence of the scleractinian coral Acropora digitifera and provide the first comprehensive view of the genome of a reef-building coral by applying advanced bioinformatics. Description Sequences from the KEGG database of protein function were used to construct hidden Markov models. These models were used to search the predicted proteome of A. digitifera to establish complete genomic annotation. The annotated dataset is published in ZoophyteBase, an open access format with different options for searching the data. A particularly useful feature is the ability to use a Google-like search engine that links query words to protein attributes. We present features of the annotation that underpin the molecular structure of key processes of coral physiology that include (1) regulatory proteins of symbiosis, (2) planula and early developmental proteins, (3) neural messengers, receptors and sensory proteins, (4) calcification and Ca2+-signalling proteins, (5) plant-derived proteins, (6) proteins of nitrogen metabolism, (7) DNA repair proteins, (8) stress response proteins, (9) antioxidant and redox-protective proteins, (10) proteins of cellular apoptosis, (11) microbial symbioses and pathogenicity proteins, (12) proteins of viral pathogenicity, (13) toxins and venom, (14) proteins of the chemical defensome and (15) coral epigenetics. Conclusions We advocate that providing annotation in an open-access searchable database available to the public domain will give an unprecedented foundation to interrogate the fundamental molecular structure and interactions of coral symbiosis and allow critical questions to be addressed at the genomic level based on combined aspects of

  8. RATT: Rapid Annotation Transfer Tool

    PubMed Central

    Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew

    2011-01-01

    Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991

  9. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis.

    PubMed

    Buisine, Nicolas; Ruan, Xiaoan; Bilesimo, Patrice; Grimaldi, Alexis; Alfama, Gladys; Ariyaratne, Pramila; Mulawadi, Fabianus; Chen, Jieqi; Sung, Wing-Kin; Liu, Edison T; Demeneix, Barbara A; Ruan, Yijun; Sachs, Laurent M

    2015-01-01

    Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10 Kb, ~17 Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome "fragmentation", reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data.

  10. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis

    PubMed Central

    Buisine, Nicolas; Ruan, Xiaoan; Bilesimo, Patrice; Grimaldi, Alexis; Alfama, Gladys; Ariyaratne, Pramila; Mulawadi, Fabianus; Chen, Jieqi; Sung, Wing-Kin; Liu, Edison T.; Demeneix, Barbara A.; Ruan, Yijun; Sachs, Laurent M.

    2015-01-01

    Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10Kb, ~17Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome “fragmentation”, reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data. PMID:26348928

  11. Efficient assembly and annotation of the transcriptome of catfish by RNA-seq analysis of a doubled haploid homozygote

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriousl...

  12. Next-Generation High-Throughput Functional Annotation of Microbial Genomes

    PubMed Central

    Baric, Ralph S.; Damania, Blossom; Miller, Samuel I.; Rubin, Eric J.

    2016-01-01

    ABSTRACT Host infection by microbial pathogens cues global changes in microbial and host cell biology that facilitate microbial replication and disease. The complete maps of thousands of bacterial and viral genomes have recently been defined; however, the rate at which physiological or biochemical functions have been assigned to genes has greatly lagged. The National Institute of Allergy and Infectious Diseases (NIAID) addressed this gap by creating functional genomics centers dedicated to developing high-throughput approaches to assign gene function. These centers require broad-based and collaborative research programs to generate and integrate diverse data to achieve a comprehensive understanding of microbial pathogenesis. High-throughput functional genomics can lead to new therapeutics and better understanding of the next generation of emerging pathogens by rapidly defining new general mechanisms by which organisms cause disease and replicate in host tissues and by facilitating the rate at which functional data reach the scientific community. PMID:27703071

  13. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    SciTech Connect

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S; Cottingham, Robert W

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to

  14. Genome annotation provides insight into carbon monoxide and hydrogen metabolism in Rubrivivax gelatinosus

    DOE PAGES

    Wawrousek, Karen; Noble, Scott; Korlach, Jonas; ...

    2014-12-05

    In this article, we report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzymemore » complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Finally and collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth.« less

  15. Genomic resources for gene discovery, functional genome annotation, and evolutionary studies of maize and its close relatives.

    PubMed

    Wang, Chao; Shi, Xue; Liu, Lin; Li, Haiyan; Ammiraju, Jetty S S; Kudrna, David A; Xiong, Wentao; Wang, Hao; Dai, Zhaozhao; Zheng, Yonglian; Lai, Jinsheng; Jin, Weiwei; Messing, Joachim; Bennetzen, Jeffrey L; Wing, Rod A; Luo, Meizhong

    2013-11-01

    Maize is one of the most important food crops and a key model for genetics and developmental biology. A genetically anchored and high-quality draft genome sequence of maize inbred B73 has been obtained to serve as a reference sequence. To facilitate evolutionary studies in maize and its close relatives, much like the Oryza Map Alignment Project (OMAP) (www.OMAP.org) bacterial artificial chromosome (BAC) resource did for the rice community, we constructed BAC libraries for maize inbred lines Zheng58, Chang7-2, and Mo17 and maize wild relatives Zea mays ssp. parviglumis and Tripsacum dactyloides. Furthermore, to extend functional genomic studies to maize and sorghum, we also constructed binary BAC (BIBAC) libraries for the maize inbred B73 and the sorghum landrace Nengsi-1. The BAC/BIBAC vectors facilitate transfer of large intact DNA inserts from BAC clones to the BIBAC vector and functional complementation of large DNA fragments. These seven Zea Map Alignment Project (ZMAP) BAC/BIBAC libraries have average insert sizes ranging from 92 to 148 kb, organellar DNA from 0.17 to 2.3%, empty vector rates between 0.35 and 5.56%, and genome equivalents of 4.7- to 8.4-fold. The usefulness of the Parviglumis and Tripsacum BAC libraries was demonstrated by mapping clones to the reference genome. Novel genes and alleles present in these ZMAP libraries can now be used for functional complementation studies and positional or homology-based cloning of genes for translational genomics.

  16. The Complete Genome Sequence and Updated Annotation of Desulfovibrio alaskensis G20

    SciTech Connect

    Hauser, Loren John; Wall, Judy D.; Brown, Steven D; Land, Miriam L; Bruce, David; Detter, J. Chris; Frank, Larimer; Goodwin, Lynne A.; Han, Cliff; Lapidus, Alla L.; Nolan, Matt; Palumbo, Anthony Vito; Pitluck, Samual; Keller, Kimberly L; Rapp-Giles, Barbara J; Price, Morgan N.; Lin, Monica; Tapia, Roxanne; Copeland, A; Cheng, Jan-Fang

    2011-01-01

    Desulfovibrio alaskensis G20 (formerly desulfuricans G20) is a Gram-negative mesophilic sulfate-reducing bacterium (SRB), known to corrode ferrous metals and to reduce toxic radionuclides and metals such as uranium and chromium to sparingly soluble and less toxic forms. We present the 3.7 Mb genome sequence to provide insights into its physiology.

  17. Complete Genome Sequence and Updated Annotation of Desulfovibrio alaskensis G20

    DOE PAGES

    Hauser, Loren J.; Land, Miriam L.; Brown, Steven D.; ...

    2011-06-17

    Desulfovibrio alaskensis G20 (formerly desulfuricans G20) is a Gram-negative mesophilic sulfate-reducing bacterium (SRB), known to corrode ferrous metals and to reduce toxic radionuclides and metals such as uranium and chromium to sparingly soluble and less toxic forms. We present the 3.7 Mb genome sequence to provide insights into its physiology.

  18. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Cancer.gov

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  19. Genome sequence and annotation of Trichoderma parareesei, the ancestor of the cellulase producer Trichoderma reesei

    DOE PAGES

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; ...

    2015-08-13

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins.

  20. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    PubMed

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation.

  1. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions

    PubMed Central

    Han, Ying; Hazelett, Dennis J.; Wiklund, Fredrik; Schumacher, Fredrick R.; Stram, Daniel O.; Berndt, Sonja I.; Wang, Zhaoming; Rand, Kristin A.; Hoover, Robert N.; Machiela, Mitchell J.; Yeager, Merideth; Burdette, Laurie; Chung, Charles C.; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C.; Key, Timothy J.; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L.; Kolb, Suzanne; Gapstur, Susan M.; Diver, W. Ryan; Stevens, Victoria L.; Strom, Sara S.; Pettaway, Curtis A.; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Yeboah, Edward D.; Tettey, Yao; Biritwum, Richard B.; Adjei, Andrew A.; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P.; Isaacs, William B.; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L.; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M.; Ingles, Sue A.; Kittles, Rick A.; Murphy, Adam B.; Blot, William J.; Signorello, Lisa B.; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M. Cristina; Wu, Suh-Yuh; Hennis, Anselm J. M.; Rybicki, Benjamin A.; Neslund-Dudas, Christine; Hsing, Ann W.; Chu, Lisa; Goodman, Phyllis J.; Klein, Eric A.; Zheng, S. Lilly; Witte, John S.; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L.; Hunter, David J.; Gronberg, Henrik; Cook, Michael B.; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J.; Easton, Douglas F.; Henderson, Brian E.; Coetzee, Gerhard A.; Conti, David V.; Haiman, Christopher A.

    2015-01-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10−4–5.6 × 10−3) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10−6) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  2. "Controlled, cross-species dataset for exploring biases in genome annotation and modification profiles".

    PubMed

    McAfee, Alison; Michaud, Sarah; Foster, Leonard J

    2015-12-01

    Since the sequencing of the honey bee genome, proteomics by mass spectrometry has become increasingly popular for biological analyses of this insect; but we have observed that the number of honey bee protein identifications is consistently low compared to other organisms [1]. In this dataset, we use nanoelectrospray ionization-coupled liquid chromatography-tandem mass spectrometry (nLC-MS/MS) to systematically investigate the root cause of low honey bee proteome coverage. To this end, we present here data from three key experiments: a controlled, cross-species analyses of samples from Apis mellifera, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens; a proteomic analysis of an individual honey bee whose genome was also sequenced; and a cross-tissue honey bee proteome comparison. The cross-species dataset was interrogated to determine relative proteome coverages between species, and the other two datasets were used to search for polymorphic sequences and to compare protein cleavage profiles, respectively.

  3. Functional Genomic Annotation of Genetic Risk Loci Highlights Inflammation and Epithelial Biology Networks in CKD

    PubMed Central

    Ledo, Nora; Ko, Yi-An; Park, Ae-Seo Deok; Kang, Hyun-Mi; Han, Sang-Youb; Choi, Peter

    2015-01-01

    Genome-wide association studies (GWASs) have identified multiple loci associated with the risk of CKD. Almost all risk variants are localized to the noncoding region of the genome; therefore, the role of these variants in CKD development is largely unknown. We hypothesized that polymorphisms alter transcription factor binding, thereby influencing the expression of nearby genes. Here, we examined the regulation of transcripts in the vicinity of CKD-associated polymorphisms in control and diseased human kidney samples and used systems biology approaches to identify potentially causal genes for prioritization. We interrogated the expression and regulation of 226 transcripts in the vicinity of 44 single nucleotide polymorphisms using RNA sequencing and gene expression arrays from 95 microdissected control and diseased tubule samples and 51 glomerular samples. Gene expression analysis from 41 tubule samples served for external validation. 92 transcripts in the tubule compartment and 34 transcripts in glomeruli showed statistically significant correlation with eGFR. Many novel genes, including ACSM2A/2B, FAM47E, and PLXDC1, were identified. We observed that the expression of multiple genes in the vicinity of any single CKD risk allele correlated with renal function, potentially indicating that genetic variants influence multiple transcripts. Network analysis of GFR-correlating transcripts highlighted two major clusters; a positive correlation with epithelial and vascular functions and an inverse correlation with inflammatory gene cluster. In summary, our functional genomics analysis highlighted novel genes and critical pathways associated with kidney function for future analysis. PMID:25231882

  4. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    PubMed Central

    Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day “Plant Genomics and Gene Annotation” workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation. PMID:18765753

  5. GRAIL and GenQuest Sequence Annotation Tools

    SciTech Connect

    Xu, Ying; Shah, Manesh B.; Einstein, J. Ralph; Parang, Morey; Snoddy, Jay; Petrov, Sergey; Olman, Victor; Zhang, Ge; Mural, Richard J.; Uberbacher, Edward C.

    1997-12-31

    Our goal is to develop and implement an integrated intelligent system which can recognize biologically significant features in DNA sequence and provide insight into the organization and function of regions of genomic DNA. GRAIL is a modular expert system which facilitates the recognition of gene features and provides an environment for the construction of sequence annotation. The last several years have seen a rapid evolution of the technology for analyzing genomic DNA sequences. The current GRAIL systems (including the e-mail, XGRAIL, JAVA-GRAIL and genQuest systems) are perhaps the most widely used, comprehensive, and user friendly systems available for computational characterization of genomic DNA sequence.

  6. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    PubMed Central

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and

  7. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    DTIC Science & Technology

    2012-11-15

    Assembly size 28.7 Mbp G+ C content 53.8% Protein coding genes 11,973 Average gene size 1,547 bp Average exons per gene 2.7 Average introns per gene 1.7...the heterokonts. A search of CCMP1779 genome identified 42 genes from 15 CAZy families that are predicted to encode glycosyltransferases (GT) and 44...Family N. oceanica C . reinhardtii A. thaliana H. sapiens C . elegans S. cerevisiae ABI3VP1 1 59 AP2-EREBP* 6 11 160 ARR-B 1 15 bHLH 6 4 160 154 50 7 BSD 1

  8. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments

    PubMed Central

    Al-Shahrour, Fátima; Minguez, Pablo; Tárraga, Joaquín; Montaner, David; Alloza, Eva; Vaquerizas, Juan M.; Conde, Lucía; Blaschke, Christian; Vera, Javier; Dopazo, Joaquín

    2006-01-01

    We present a new version of Babelomics, a complete suite of web tools for functional analysis of genome-scale experiments, with new and improved tools. New functionally relevant terms have been included such as CisRed motifs or bioentities obtained by text-mining procedures. An improved indexing has considerably speeded up several of the modules. An improved version of the FatiScan method for studying the coordinate behaviour of groups of functionally related genes is presented, along with a similar tool, the Gene Set Enrichment Analysis. Babelomics is now more oriented to test systems biology inspired hypotheses. Babelomics can be found at . PMID:16845052

  9. Comparisons of Shewanella strains based on genome annotations, modeling and experiments

    SciTech Connect

    Ong, Wai Kit; Vu, Trang; Lovendahl, Klaus N.; Llull, Jenna; Serres, Margaret; Romine, Margaret F.; Reed, Jennifer L.

    2014-01-01

    Shewanella is a genus of facultatively anaerobic, Gram-negative bacteria that have highly adaptable metabolism which allows them to thrive in diverse environments. This quality makes them attractive target bacteria for research in bioremediation and microbial fuel cell applications. Constraint-based modeling is a useful tool for helping researchers gain insights into the metabolic capabilities of these bacteria. However, Shewanella oneidensis MR-1 is the only strain with a genome-scale metabolic model constructed out of the 22 sequenced Shewanella strains.

  10. Enhanced genome annotation using structural profiles in the program 3D-PSSM.

    PubMed

    Kelley, L A; MacCallum, R M; Sternberg, M J

    2000-06-02

    A method (three-dimensional position-specific scoring matrix, 3D-PSSM) to recognise remote protein sequence homologues is described. The method combines the power of multiple sequence profiles with knowledge of protein structure to provide enhanced recognition and thus functional assignment of newly sequenced genomes. The method uses structural alignments of homologous proteins of similar three-dimensional structure in the structural classification of proteins (SCOP) database to obtain a structural equivalence of residues. These equivalences are used to extend multiply aligned sequences obtained by standard sequence searches. The resulting large superfamily-based multiple alignment is converted into a PSSM. Combined with secondary structure matching and solvation potentials, 3D-PSSM can recognise structural and functional relationships beyond state-of-the-art sequence methods. In a cross-validated benchmark on 136 homologous relationships unambiguously undetectable by position-specific iterated basic local alignment search tool (PSI-Blast), 3D-PSSM can confidently assign 18 %. The method was applied to the remaining unassigned regions of the Mycoplasma genitalium genome and an additional 13 regions were assigned with 95 % confidence. 3D-PSSM is available to the community as a web server: http://www.bmm.icnet.uk/servers/3dpssm

  11. GELBANK : A database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes.

    SciTech Connect

    Babnigg, G.; Giometti, C. S.; Biosciences Division

    2004-01-01

    GELBANK is a publicly available database of two-dimensional gel electrophoresis (2DE) gel patterns of proteomes from organisms with known genome information (available at and ftp://bioinformatics.anl.gov/gelbank/). Currently it includes 131 completed, mostly microbial proteomes available from the National Center for Biotechnology Information. A web interface allows the upload of 2D gel patterns and their annotation for registered users. The images are organized by species, tissue type, separation method, sample type and staining method. The database can be queried based on protein or 2DE-pattern attributes. A web interface allows registered users to assign molecular weight and pH gradient profiles to their own 2D gel patterns as well as to link protein identifications to a given spot on the pattern. The website presents all of the submitted 2D gel patterns where the end-user can dynamically display the images or parts of images along with molecular weight, pH profile information and linked protein identification. A collection of images can be selected for the creation of animations from which the user can select sub-regions of interest and unlimited 2D gel patterns for visualization. The website currently presents 233 identifications for 81 gel patterns for Homo sapiens, Methanococcus jannaschii, Pyro coccus furiosus, Shewanella oneidensis, Escherichia coli and Deinococcus radiodurans.

  12. Comprehensive transcriptome and improved genome annotation of Bacillus licheniformis WX-02.

    PubMed

    Guo, Jing; Cheng, Gang; Gou, Xiang-Yong; Xing, Feng; Li, Sen; Han, Yi-Chao; Wang, Long; Song, Jia-Ming; Shu, Cheng-Cheng; Chen, Shou-Wen; Chen, Ling-Ling

    2015-08-19

    The updated genome of Bacillus licheniformis WX-02 comprises a circular chromosome of 4286821 base-pairs containing 4512 protein-coding genes. We applied strand-specific RNA-sequencing to explore the transcriptome profiles of B. licheniformis WX-02 under normal and high-salt conditions (NaCl 6%). We identified 2381 co-expressed gene pairs constituting 871 operon structures. In addition, 1169 antisense transcripts and 90 small RNAs were detected. Systematic comparison of differentially expressed genes under different conditions revealed that genes involved in multiple functions were significantly repressed in long-term high salt adaptation process. Genes related to promotion of glutamic acid synthesis were activated by 6% NaCl, potentially explaining the high yield of γ-PGA under salt condition. This study will be useful for the optimization of crucial metabolic activities in this bacterium.

  13. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  14. Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets

    PubMed Central

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M. Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J.

    2014-01-01

    Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ∼85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

  15. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.

  16. Functional annotation of HOT regions in the human genome: implications for human disease and cancer.

    PubMed

    Li, Hao; Chen, Hebing; Liu, Feng; Ren, Chao; Wang, Shengqi; Bo, Xiaochen; Shu, Wenjie

    2015-06-26

    Advances in genome-wide association studies (GWAS) and large-scale sequencing studies have resulted in an impressive and growing list of disease- and trait-associated genetic variants. Most studies have emphasised the discovery of genetic variation in coding sequences, however, the noncoding regulatory effects responsible for human disease and cancer biology have been substantially understudied. To better characterise the cis-regulatory effects of noncoding variation, we performed a comprehensive analysis of the genetic variants in HOT (high-occupancy target) regions, which are considered to be one of the most intriguing findings of recent large-scale sequencing studies. We observed that GWAS variants that map to HOT regions undergo a substantial net decrease and illustrate development-specific localisation during haematopoiesis. Additionally, genetic risk variants are disproportionally enriched in HOT regions compared with LOT (low-occupancy target) regions in both disease-relevant and cancer cells. Importantly, this enrichment is biased toward disease- or cancer-specific cell types. Furthermore, we observed that cancer cells generally acquire cancer-specific HOT regions at oncogenes through diverse mechanisms of cancer pathogenesis. Collectively, our findings demonstrate the key roles of HOT regions in human disease and cancer and represent a critical step toward further understanding disease biology, diagnosis, and therapy.

  17. Genetic fine-mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci

    PubMed Central

    Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Goncalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Borringer, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex SF; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian’an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Perry, John RB; Platou, Carl GP; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth JF; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöckel, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin NA; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O’Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-01-01

    We performed fine-mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in/near KCNQ1. “Credible sets” of variants most likely to drive each distinct signal mapped predominantly to non-coding sequence, implying that T2D association is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine-mapping implicated rs10830963 as driving T2D association. We confirmed that this T2D-risk allele increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D-risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease. PMID:26551672

  18. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

    PubMed

    Gaulton, Kyle J; Ferreira, Teresa; Lee, Yeji; Raimondo, Anne; Mägi, Reedik; Reschen, Michael E; Mahajan, Anubha; Locke, Adam; Rayner, N William; Robertson, Neil; Scott, Robert A; Prokopenko, Inga; Scott, Laura J; Green, Todd; Sparso, Thomas; Thuillier, Dorothee; Yengo, Loic; Grallert, Harald; Wahl, Simone; Frånberg, Mattias; Strawbridge, Rona J; Kestler, Hans; Chheda, Himanshu; Eisele, Lewin; Gustafsson, Stefan; Steinthorsdottir, Valgerdur; Thorleifsson, Gudmar; Qi, Lu; Karssen, Lennart C; van Leeuwen, Elisabeth M; Willems, Sara M; Li, Man; Chen, Han; Fuchsberger, Christian; Kwan, Phoenix; Ma, Clement; Linderman, Michael; Lu, Yingchang; Thomsen, Soren K; Rundle, Jana K; Beer, Nicola L; van de Bunt, Martijn; Chalisey, Anil; Kang, Hyun Min; Voight, Benjamin F; Abecasis, Gonçalo R; Almgren, Peter; Baldassarre, Damiano; Balkau, Beverley; Benediktsson, Rafn; Blüher, Matthias; Boeing, Heiner; Bonnycastle, Lori L; Bottinger, Erwin P; Burtt, Noël P; Carey, Jason; Charpentier, Guillaume; Chines, Peter S; Cornelis, Marilyn C; Couper, David J; Crenshaw, Andrew T; van Dam, Rob M; Doney, Alex S F; Dorkhan, Mozhgan; Edkins, Sarah; Eriksson, Johan G; Esko, Tonu; Eury, Elodie; Fadista, João; Flannick, Jason; Fontanillas, Pierre; Fox, Caroline; Franks, Paul W; Gertow, Karl; Gieger, Christian; Gigante, Bruna; Gottesman, Omri; Grant, George B; Grarup, Niels; Groves, Christopher J; Hassinen, Maija; Have, Christian T; Herder, Christian; Holmen, Oddgeir L; Hreidarsson, Astradur B; Humphries, Steve E; Hunter, David J; Jackson, Anne U; Jonsson, Anna; Jørgensen, Marit E; Jørgensen, Torben; Kao, Wen-Hong L; Kerrison, Nicola D; Kinnunen, Leena; Klopp, Norman; Kong, Augustine; Kovacs, Peter; Kraft, Peter; Kravic, Jasmina; Langford, Cordelia; Leander, Karin; Liang, Liming; Lichtner, Peter; Lindgren, Cecilia M; Lindholm, Eero; Linneberg, Allan; Liu, Ching-Ti; Lobbens, Stéphane; Luan, Jian'an; Lyssenko, Valeriya; Männistö, Satu; McLeod, Olga; Meyer, Julia; Mihailov, Evelin; Mirza, Ghazala; Mühleisen, Thomas W; Müller-Nurasyid, Martina; Navarro, Carmen; Nöthen, Markus M; Oskolkov, Nikolay N; Owen, Katharine R; Palli, Domenico; Pechlivanis, Sonali; Peltonen, Leena; Perry, John R B; Platou, Carl G P; Roden, Michael; Ruderfer, Douglas; Rybin, Denis; van der Schouw, Yvonne T; Sennblad, Bengt; Sigurðsson, Gunnar; Stančáková, Alena; Steinbach, Gerald; Storm, Petter; Strauch, Konstantin; Stringham, Heather M; Sun, Qi; Thorand, Barbara; Tikkanen, Emmi; Tonjes, Anke; Trakalo, Joseph; Tremoli, Elena; Tuomi, Tiinamaija; Wennauer, Roman; Wiltshire, Steven; Wood, Andrew R; Zeggini, Eleftheria; Dunham, Ian; Birney, Ewan; Pasquali, Lorenzo; Ferrer, Jorge; Loos, Ruth J F; Dupuis, Josée; Florez, Jose C; Boerwinkle, Eric; Pankow, James S; van Duijn, Cornelia; Sijbrands, Eric; Meigs, James B; Hu, Frank B; Thorsteinsdottir, Unnur; Stefansson, Kari; Lakka, Timo A; Rauramaa, Rainer; Stumvoll, Michael; Pedersen, Nancy L; Lind, Lars; Keinanen-Kiukaanniemi, Sirkka M; Korpi-Hyövälti, Eeva; Saaristo, Timo E; Saltevo, Juha; Kuusisto, Johanna; Laakso, Markku; Metspalu, Andres; Erbel, Raimund; Jöcke, Karl-Heinz; Moebus, Susanne; Ripatti, Samuli; Salomaa, Veikko; Ingelsson, Erik; Boehm, Bernhard O; Bergman, Richard N; Collins, Francis S; Mohlke, Karen L; Koistinen, Heikki; Tuomilehto, Jaakko; Hveem, Kristian; Njølstad, Inger; Deloukas, Panagiotis; Donnelly, Peter J; Frayling, Timothy M; Hattersley, Andrew T; de Faire, Ulf; Hamsten, Anders; Illig, Thomas; Peters, Annette; Cauchi, Stephane; Sladek, Rob; Froguel, Philippe; Hansen, Torben; Pedersen, Oluf; Morris, Andrew D; Palmer, Collin N A; Kathiresan, Sekar; Melander, Olle; Nilsson, Peter M; Groop, Leif C; Barroso, Inês; Langenberg, Claudia; Wareham, Nicholas J; O'Callaghan, Christopher A; Gloyn, Anna L; Altshuler, David; Boehnke, Michael; Teslovich, Tanya M; McCarthy, Mark I; Morris, Andrew P

    2015-12-01

    We performed fine mapping of 39 established type 2 diabetes (T2D) loci in 27,206 cases and 57,574 controls of European ancestry. We identified 49 distinct association signals at these loci, including five mapping in or near KCNQ1. 'Credible sets' of the variants most likely to drive each distinct signal mapped predominantly to noncoding sequence, implying that association with T2D is mediated through gene regulation. Credible set variants were enriched for overlap with FOXA2 chromatin immunoprecipitation binding sites in human islet and liver cells, including at MTNR1B, where fine mapping implicated rs10830963 as driving T2D association. We confirmed that the T2D risk allele for this SNP increases FOXA2-bound enhancer activity in islet- and liver-derived cells. We observed allele-specific differences in NEUROD1 binding in islet-derived cells, consistent with evidence that the T2D risk allele increases islet MTNR1B expression. Our study demonstrates how integration of genetic and genomic information can define molecular mechanisms through which variants underlying association signals exert their effects on disease.

  19. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG

  20. An integrated pipeline for next generation sequencing and annotation of the complete mitochondrial genome of the giant intestinal fluke, Fasciolopsis buski (Lankester, 1857) Looss, 1899

    PubMed Central

    Biswal, Devendra Kumar; Ghatani, Sudeep; Shylla, Jollin A.; Sahu, Ranjana; Mullapudi, Nandita

    2013-01-01

    Helminths include both parasitic nematodes (roundworms) and platyhelminths (trematode and cestode flatworms) that are abundant, and are of clinical importance. The genetic characterization of parasitic flatworms using advanced molecular tools is central to the diagnosis and control of infections. Although the nuclear genome houses suitable genetic markers (e.g., in ribosomal (r) DNA) for species identification and molecular characterization, the mitochondrial (mt) genome consistently provides a rich source of novel markers for informative systematics and epidemiological studies. In the last decade, there have been some important advances in mtDNA genomics of helminths, especially lung flukes, liver flukes and intestinal flukes. Fasciolopsis buski, often called the giant intestinal fluke, is one of the largest digenean trematodes infecting humans and found primarily in Asia, in particular the Indian subcontinent. Next-generation sequencing (NGS) technologies now provide opportunities for high throughput sequencing, assembly and annotation within a short span of time. Herein, we describe a high-throughput sequencing and bioinformatics pipeline for mt genomics for F. buski that emphasizes the utility of short read NGS platforms such as Ion Torrent and Illumina in successfully sequencing and assembling the mt genome using innovative approaches for PCR primer design as well as assembly. We took advantage of our NGS whole genome sequence data (unpublished so far) for F. buski and its comparison with available data for the Fasciola hepatica mtDNA as the reference genome for design of precise and specific primers for amplification of mt genome sequences from F. buski. A long-range PCR was carried out to create an NGS library enriched in mt DNA sequences. Two different NGS platforms were employed for complete sequencing, assembly and annotation of the F. buski mt genome. The complete mt genome sequences of the intestinal fluke comprise 14,118 bp and is thus the shortest

  1. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

    PubMed

    Arloth, Janine; Bader, Daniel M; Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer's annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2.

  2. Improving the genome annotation of the acarbose producer Actinoplanes sp. SE50/110 by sequencing enriched 5'-ends of primary transcripts.

    PubMed

    Schwientek, Patrick; Neshat, Armin; Kalinowski, Jörn; Klein, Andreas; Rückert, Christian; Schneiker-Bekel, Susanne; Wendler, Sergej; Stoye, Jens; Pühler, Alfred

    2014-11-20

    Actinoplanes sp. SE50/110 is the producer of the alpha-glucosidase inhibitor acarbose, which is an economically relevant and potent drug in the treatment of type-2 diabetes mellitus. In this study, we present the detection of transcription start sites on this genome by sequencing enriched 5'-ends of primary transcripts. Altogether, 1427 putative transcription start sites were initially identified. With help of the annotated genome sequence, 661 transcription start sites were found to belong to the leader region of protein-coding genes with the surprising result that roughly 20% of these genes rank among the class of leaderless transcripts. Next, conserved promoter motifs were identified for protein-coding genes with and without leader sequences. The mapped transcription start sites were finally used to improve the annotation of the Actinoplanes sp. SE50/110 genome sequence. Concerning protein-coding genes, 41 translation start sites were corrected and 9 novel protein-coding genes could be identified. In addition to this, 122 previously undetermined non-coding RNA (ncRNA) genes of Actinoplanes sp. SE50/110 were defined. Focusing on antisense transcription start sites located within coding genes or their leader sequences, it was discovered that 96 of those ncRNA genes belong to the class of antisense RNA (asRNA) genes. The remaining 26 ncRNA genes were found outside of known protein-coding genes. Four chosen examples of prominent ncRNA genes, namely the transfer messenger RNA gene ssrA, the ribonuclease P class A RNA gene rnpB, the cobalamin riboswitch RNA gene cobRS, and the selenocysteine-specific tRNA gene selC, are presented in more detail. This study demonstrates that sequencing of enriched 5'-ends of primary transcripts and the identification of transcription start sites are valuable tools for advanced genome annotation of Actinoplanes sp. SE50/110 and most probably also for other bacteria.

  3. Integrating genome annotation and QTL position to identify candidate genes for productivity, architecture and water-use efficiency in Populus spp

    PubMed Central

    2012-01-01

    Background Hybrid poplars species are candidates for biomass production but breeding efforts are needed to combine productivity and water use efficiency in improved cultivars. The understanding of the genetic architecture of growth in poplar by a Quantitative Trait Loci (QTL) approach can help us to elucidate the molecular basis of such integrative traits but identifying candidate genes underlying these QTLs remains difficult. Nevertheless, the increase of genomic information together with the accessibility to a reference genome sequence (Populus trichocarpa Nisqually-1) allow to bridge QTL information on genetic maps and physical location of candidate genes on the genome. The objective of the study is to identify QTLs controlling productivity, architecture and leaf traits in a P. deltoides x P. trichocarpa F1 progeny and to identify candidate genes underlying QTLs based on the anchoring of genetic maps on the genome and the gene ontology information linked to genome annotation. The strategy to explore genome annotation was to use Gene Ontology enrichment tools to test if some functional categories are statistically over-represented in QTL regions. Results Four leaf traits and 7 growth traits were measured on 330 F1 P. deltoides x P. trichocarpa progeny. A total of 77 QTLs controlling 11 traits were identified explaining from 1.8 to 17.2% of the variation of traits. For 58 QTLs, confidence intervals could be projected on the genome. An extended functional annotation was built based on data retrieved from the plant genome database Phytozome and from an inference of function using homology between Populus and the model plant Arabidopsis. Genes located within QTL confidence intervals were retrieved and enrichments in gene ontology (GO) terms were determined using different methods. Significant enrichments were found for all traits. Particularly relevant biological processes GO terms were identified for QTLs controlling number of sylleptic branches: intervals were

  4. The Ensembl gene annotation system

    PubMed Central

    Aken, Bronwen L.; Ayling, Sarah; Barrell, Daniel; Clarke, Laura; Curwen, Valery; Fairley, Susan; Fernandez Banet, Julio; Billis, Konstantinos; García Girón, Carlos; Hourlier, Thibaut; Howe, Kevin; Kähäri, Andreas; Kokocinski, Felix; Martin, Fergal J.; Murphy, Daniel N.; Nag, Rishi; Ruffier, Magali; Schuster, Michael; Tang, Y. Amy; Vogel, Jan-Hinnerk; White, Simon; Zadissa, Amonida; Flicek, Paul

    2016-01-01

    The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail. Database URL: http://www.ensembl.org/index.html PMID:27337980

  5. Genomic research, publics and experts in Latin America: Nation, race and body

    PubMed Central

    Wade, Peter; López-Beltrán, Carlos; Restrepo, Eduardo; Santos, Ricardo Ventura

    2015-01-01

    The articles in this issue highlight contributions that studies of Latin America can make to wider debates about the effects of genomic science on public ideas about race and nation. We argue that current ideas about the power of genomics to transfigure and transform existing ways of thinking about human diversity are often overstated. If a range of social contexts are examined, the effects are uneven. Our data show that genomic knowledge can unsettle and reinforce ideas of nation and race; it can be both banal and highly politicized. In this introduction, we outline concepts of genetic knowledge in society; theories of genetics, nation and race; approaches to public understandings of science; and the Latin American contexts of transnational ideas of nation and race. PMID:27479996

  6. Genomic research, publics and experts in Latin America: Nation, race and body.

    PubMed

    Wade, Peter; López-Beltrán, Carlos; Restrepo, Eduardo; Santos, Ricardo Ventura

    2015-12-01

    The articles in this issue highlight contributions that studies of Latin America can make to wider debates about the effects of genomic science on public ideas about race and nation. We argue that current ideas about the power of genomics to transfigure and transform existing ways of thinking about human diversity are often overstated. If a range of social contexts are examined, the effects are uneven. Our data show that genomic knowledge can unsettle and reinforce ideas of nation and race; it can be both banal and highly politicized. In this introduction, we outline concepts of genetic knowledge in society; theories of genetics, nation and race; approaches to public understandings of science; and the Latin American contexts of transnational ideas of nation and race.

  7. RNA-Seq of Guar (Cyamopsis tetragonoloba, L. Taub.) Leaves: De novo Transcriptome Assembly, Functional Annotation and Development of Genomic Resources

    PubMed Central

    Tanwar, Umesh K.; Pruthi, Vikas; Randhawa, Gursharn S.

    2017-01-01

    Genetic improvement in industrially important guar (Cyamopsis tetragonoloba, L. Taub.) crop has been hindered due to the lack of sufficient genomic or transcriptomic resources. In this study, RNA-Seq technology was employed to characterize the transcriptome of leaf tissues from two guar varieties, namely, M-83 and RGC-1066. Approximately 30 million high-quality pair-end reads of each variety generated by Illumina HiSeq platform were used for de novo assembly by Trinity program. A total of 62,146 non-redundant unigenes with an average length of 679 bp were obtained. The quality assessment of assembled unigenes revealed 87.50% of complete and 97.18% partial core eukaryotic genes (CEGs). Sequence similarity analyses and annotation of the unigenes against non-redundant protein (Nr) and Gene Ontology (GO) databases identified 175,882 GO annotations. A total of 11,308 guar unigenes were annotated with various enzyme codes (EC) and categorized in six categories with 55 subclasses. The annotation of biochemical pathways resulted in a total of 11,971 unigenes assigned with 145 KEGG maps and 1759 enzyme codes. The species distribution analysis of the unigenes showed highest similarity with Glycine max genes. A total of 5773 potential simple sequence repeats (SSRs) and 3594 high-quality single nucleotide polymorphisms (SNPs) were identified. Out of 20 randomly selected SSRs for wet laboratory validation, 13 showed consistent PCR amplification in both guar varieties. In silico studies identified 145 polymorphic SSR markers in two varieties. To the best of our knowledge, this is the first report on transcriptome analysis and SNPs identification in guar till date. PMID:28210265

  8. Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies†

    PubMed Central

    Shriner, Daniel; Baye, Tesfaye M.; Padilla, Miguel A.; Zhang, Shiju; Vaughan, Laura K.; Loraine, Ann E.

    2008-01-01

    Linkage studies of complex traits frequently yield multiple linkage regions covering hundreds of genes. Testing each candidate gene from every region is prohibitively expensive and computational methods that simplify this process would benefit genetic research. We present a new method based on commonality of functional annotation (CFA) that aids dissection of complex traits for which multiple causal genes act in a single pathway or process. CFA works by testing individual Gene Ontology (GO) terms for enrichment among candidate gene pools, performs multiple hypothesis testing adjustment using an estimate of independent tests based on correlation of GO terms, and then scores and ranks genes annotated with significantly-enriched terms based on the number of quantitative trait loci regions in which genes bearing those annotations appear. We evaluate CFA using simulated linkage data and show that CFA has good power despite being conservative. We apply CFA to published linkage studies investigating age-of-onset of Alzheimer's disease and body mass index and obtain previously known and new candidate genes. CFA provides a new tool for studies in which causal genes are expected to participate in a common pathway or process and can easily be extended to utilize annotation schemes in addition to the GO. PMID:18263617

  9. Quality of computationally inferred gene ontology annotations.

    PubMed

    Skunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-05-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon-an important outcome given that >98% of all annotations are inferred without direct curation.

  10. Custom Array Comparative Genomic Hybridization: the Importance of DNA Quality, an Expert Eye, and Variant Validation

    PubMed Central

    Lantieri, Francesca; Malacarne, Michela; Gimelli, Stefania; Santamaria, Giuseppe; Coviello, Domenico; Ceccherini, Isabella

    2017-01-01

    The presence of false positive and false negative results in the Array Comparative Genomic Hybridization (aCGH) design is poorly addressed in literature reports. We took advantage of a custom aCGH recently carried out to analyze its design performance, the use of several Agilent aberrations detection algorithms, and the presence of false results. Our study provides a confirmation that the high density design does not generate more noise than standard designs and, might reach a good resolution. We noticed a not negligible presence of false negative and false positive results in the imbalances call performed by the Agilent software. The Aberration Detection Method 2 (ADM-2) algorithm with a threshold of 6 performed quite well, and the array design proved to be reliable, provided that some additional filters are applied, such as considering only intervals with average absolute log2ratio above 0.3. We also propose an additional filter that takes into account the proportion of probes with log2ratio exceeding suggestive values for gain or loss. In addition, the quality of samples was confirmed to be a crucial parameter. Finally, this work raises the importance of evaluating the samples profiles by eye and the necessity of validating the imbalances detected. PMID:28287439

  11. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (∼3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  12. IMG 4 version of the integrated microbial genomes comparative analysis system

    SciTech Connect

    Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Woyke, Tanja; Huntemann, Marcel; Anderson, Iain; Billis, Konstantinos; Varghese, Neha; Mavromatis, Konstantinos; Pati, Amrita; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2013-10-27

    The Integrated Microbial Genomes (IMG) data warehouse integrates genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG provides tools for analyzing and reviewing the structural and functional annotations of genomes in a comparative context. IMG’s data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG’s annotation and data integration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Finally, different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu).

  13. IMG 4 version of the integrated microbial genomes comparative analysis system

    PubMed Central

    Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Woyke, Tanja; Huntemann, Marcel; Anderson, Iain; Billis, Konstantinos; Varghese, Neha; Mavromatis, Konstantinos; Pati, Amrita; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2014-01-01

    The Integrated Microbial Genomes (IMG) data warehouse integrates genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG provides tools for analyzing and reviewing the structural and functional annotations of genomes in a comparative context. IMG’s data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG’s annotation and data integration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu). PMID:24165883

  14. IMG 4 version of the integrated microbial genomes comparative analysis system.

    PubMed

    Markowitz, Victor M; Chen, I-Min A; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Woyke, Tanja; Huntemann, Marcel; Anderson, Iain; Billis, Konstantinos; Varghese, Neha; Mavromatis, Konstantinos; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C

    2014-01-01

    The Integrated Microbial Genomes (IMG) data warehouse integrates genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG provides tools for analyzing and reviewing the structural and functional annotations of genomes in a comparative context. IMG's data content and analytical capabilities have increased continuously since its first version released in 2005. Since the last report published in the 2012 NAR Database Issue, IMG's annotation and data integration pipelines have evolved while new tools have been added for recording and analyzing single cell genomes, RNA Seq and biosynthetic cluster data. Different IMG datamarts provide support for the analysis of publicly available genomes (IMG/W: http://img.jgi.doe.gov/w), expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er) and teaching and training in the area of microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu).

  15. Graph-based sequence annotation using a data integration approach.

    PubMed

    Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan

    2008-08-25

    The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

  16. Annotation of a hybrid partial genome of the coffee rust (Hemileia vastatrix) contributes to the gene repertoire catalog of the Pucciniales

    PubMed Central

    Cristancho, Marco A.; Botero-Rozo, David Octavio; Giraldo, William; Tabima, Javier; Riaño-Pachón, Diego Mauricio; Escobar, Carolina; Rozo, Yomara; Rivera, Luis F.; Durán, Andrés; Restrepo, Silvia; Eilam, Tamar; Anikster, Yehoshua; Gaitán, Alvaro L.

    2014-01-01

    Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333 Mb was built based on the 8 isolates; this assembly was used for subsequent analyses. Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3921 families were uncovered; a considerable proportion of the predicted proteins (73.8%) were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish races/isolates. PMID

  17. Genome annotation of a 1.5 Mb region of human chromosome 6q23 encompassing a quantitative trait locus for fetal hemoglobin expression in adults

    PubMed Central

    Close, James; Game, Laurence; Clark, Barnaby; Bergounioux, Jean; Gerovassili, Ageliki; Thein, Swee Lay

    2004-01-01

    Background Heterocellular hereditary persistence of fetal hemoglobin (HPFH) is a common multifactorial trait characterized by a modest increase of fetal hemoglobin levels in adults. We previously localized a Quantitative Trait Locus for HPFH in an extensive Asian-Indian kindred to chromosome 6q23. As part of the strategy of positional cloning and a means towards identification of the specific genetic alteration in this family, a thorough annotation of the candidate interval based on a strategy of in silico / wet biology approach with comparative genomics was conducted. Results The ~1.5 Mb candidate region was shown to contain five protein-coding genes. We discovered a very large uncharacterized gene containing WD40 and SH3 domains (AHI1), and extended the annotation of four previously characterized genes (MYB, ALDH8A1, HBS1L and PDE7B). We also identified several genes that do not appear to be protein coding, and generated 17 kb of novel transcript sequence data from re-sequencing 97 EST clones. Conclusion Detailed and thorough annotation of this 1.5 Mb interval in 6q confirms a high level of aberrant transcripts in testicular tissue. The candidate interval was shown to exhibit an extraordinary level of alternate splicing – 19 transcripts were identified for the 5 protein coding genes, but it appears that a significant portion (14/19) of these alternate transcripts did not have an open reading frame, hence their functional role is questionable. These transcripts may result from aberrant rather than regulated splicing. PMID:15169551

  18. All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs.

    PubMed

    Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M

    2013-04-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci.

  19. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genome wide analysis of orthologous clusters is an important component of comparative genomics studies. Identifying the overlap among orthologous clusters can enable us to elucidate the function and evolution of proteins across multiple species. Here, we report a web platform named OrthoVenn that i...

  20. Analysis of the multi-copied genes and the impact of the redundant protein coding sequences on gene annotation in prokaryotic genomes.

    PubMed

    Yu, Jia-Feng; Chen, Qing-Li; Ren, Jing; Yang, Yan-Ling; Wang, Ji-Hua; Sun, Xiao

    2015-07-07

    The important roles of duplicated genes in evolutional process have been recognized in bacteria, archaebacteria and eukaryotes, while there is very little study on the multi-copied protein coding genes that share sequence identity of 100%. In this paper, the multi-copied protein coding genes in a number of prokaryotic genomes are comprehensively analyzed firstly. The results show that 0-15.93% of the protein coding genes in each genome are multi-copied genes and 0-16.49% of the protein coding genes in each genome are highly similar with the sequence identity ≥ 80%. Function and COG (Clusters of Orthologous Groups of proteins) analysis shows that 64.64% of multi-copied genes concentrate on the function of transposase and 86.28% of the COG assigned multi-copied genes concentrate on the COG code of 'L'. Furthermore, the impact of redundant protein coding sequences on the gene prediction results is studied. The results show that the problem of protein coding sequence redundancies cannot be ignored and the consistency of the gene annotation results before and after excluding the redundant sequences is negatively related with the sequences redundancy degree of the protein coding sequences in the training set.

  1. Large-scale functional annotation and expanded implementations of the P{wHy} hybrid transposon in the Drosophila melanogaster genome.

    PubMed

    Myrick, Kyl V; Huet, François; Mohr, Stephanie E; Alvarez-García, Inés; Lu, Jeffrey T; Smith, Mark A; Crosby, Madeline A; Gelbart, William M

    2009-07-01

    Whole genome sequencing of the model organisms has created increased demand for efficient tools to facilitate the genome annotation efforts. Accordingly, we report the further implementations and analyses stemming from our publicly available P{wHy} library for Drosophila melanogaster. A two-step regime-large scale transposon mutagenesis followed by hobo-induced nested deletions-allows mutation saturation and provides significant enhancements to existing genomic coverage. We previously showed that, for a given starting insert, deletion saturation is readily obtained over a 60-kb interval; here, we perform a breakdown analysis of efficiency to identify rate-limiting steps in the process. Transrecombination, the hobo-induced recombination between two P{wHy} half molecules, was shown to further expand the P{wHy} mutational range, pointing to a potent, iterative process of transrecombination-reconstitution-transrecombination for alternating between very large and very fine-grained deletions in a self-contained manner. A number of strains also showed partial or complete repression of P{wHy} markers, depending on chromosome location, whereby asymmetric marker silencing allowed continuous phenotypic detection, indicating that P{wHy}-based saturational mutagenesis should be useful for the study of heterochromatin/positional effects.

  2. Whole-Genome Sequencing and Annotation of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum annuum (Bird's Eye Chili) and Capsicum chinense (Yellow Lantern Chili), Respectively.

    PubMed

    Gan, Huan You; Gan, Han Ming; Savka, Michael A; Triassi, Alexander J; Wheatley, Matthew S; Naqvi, Kubra F; Foxhall, Taylor E; Anauo, Michael J; Baldwin, Mariah L; Burkhardt, Russell N; O'Bryon, Isabelle G; Dailey, Lucas K; Busairi, Nurfatini Idayu; Keith, Robert C; Khair, Megat Hazmah Megat Mazhar; Rasul, Muhammad Zamir Mohd; Rosdi, Nur Aiman Mohd; Mountzouros, James R; Rhoads, Aleigha C; Selochan, Melissa A; Tautanov, Timur B; Polter, Steven J; Marks, Kayla D; Caraballo, Alexander A; Hudson, André O

    2015-04-16

    Here, we report the genome sequences of Bacillus safensis RIT372 and Pseudomonas oryzihabitans RIT370 from Capsicum spp. Annotation revealed gene clusters for the synthesis of bacilysin, lichensin, and bacillibactin and sporulation killing factor (skfA) in Bacillus safensis RIT372 and turnerbactin and carotenoid in Pseudomonas oryzihabitans RIT370.

  3. Guitar: An R/Bioconductor Package for Gene Annotation Guided Transcriptomic Analysis of RNA-Related Genomic Features.

    PubMed

    Cui, Xiaodong; Wei, Zhen; Zhang, Lin; Liu, Hui; Sun, Lei; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2016-01-01

    Biological features, such as genes and transcription factor binding sites, are often denoted with genome-based coordinates as the genomic features. While genome-based representation is usually very effective in correlating various biological features, it can be tedious to examine the relationship between RNA-related genomic features and the landmarks of RNA transcripts with existing tools due to the difficulty in the conversion between genome-based coordinates and RNA-based coordinates. We developed here an open source Guitar R/Bioconductor package for sketching the transcriptomic view of RNA-related biological features represented by genome based coordinates. Internally, Guitar package extracts the standardized RNA coordinates with respect to the landmarks of RNA transcripts, with which hundreds of millions of RNA-related genomic features can then be efficiently analyzed within minutes. We demonstrated the usage of Guitar package in analyzing posttranscriptional RNA modifications (5-methylcytosine and N6-methyladenosine) derived from high-throughput sequencing approaches (MeRIP-Seq and RNA BS-Seq) and show that RNA 5-methylcytosine (m(5)C) is enriched in 5'UTR. The newly developed Guitar R/Bioconductor package achieves stable performance on the data tested and revealed novel biological insights. It will effectively facilitate the analysis of RNA methylation data and other RNA-related biological features in the future.

  4. TU-CD-BRB-07: Identification of Associations Between Radiologist-Annotated Imaging Features and Genomic Alterations in Breast Invasive Carcinoma, a TCGA Phenotype Research Group Study

    SciTech Connect

    Rao, A; Net, J; Brandt, K; Huang, E; Freymann, J; Kirby, J; Burnside, E; Morris, E; Sutton, E; Bonaccio, E; Giger, M; Jaffe, C; Ganott, M; Zuley, M; Le-Petross, H; Dogan, B; Whitman, G

    2015-06-15

    Purpose: To determine associations between radiologist-annotated MRI features and genomic measurements in breast invasive carcinoma (BRCA) from the Cancer Genome Atlas (TCGA). Methods: 98 TCGA patients with BRCA were assessed by a panel of radiologists (TCGA Breast Phenotype Research Group) based on a variety of mass and non-mass features according to the Breast Imaging Reporting and Data System (BI-RADS). Batch corrected gene expression data was obtained from the TCGA Data Portal. The Kruskal-Wallis test was used to assess correlations between categorical image features and tumor-derived genomic features (such as gene pathway activity, copy number and mutation characteristics). Image-derived features were also correlated with estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2/neu) status. Multiple hypothesis correction was done using Benjamini-Hochberg FDR. Associations at an FDR of 0.1 were selected for interpretation. Results: ER status was associated with rim enhancement and peritumoral edema. PR status was associated with internal enhancement. Several components of the PI3K/Akt pathway were associated with rim enhancement as well as heterogeneity. In addition, several components of cell cycle regulation and cell division were associated with imaging characteristics.TP53 and GATA3 mutations were associated with lesion size. MRI features associated with TP53 mutation status were rim enhancement and peritumoral edema. Rim enhancement was associated with activity of RB1, PIK3R1, MAP3K1, AKT1,PI3K, and PIK3CA. Margin status was associated with HIF1A/ARNT, Ras/ GTP/PI3K, KRAS, and GADD45A. Axillary lymphadenopathy was associated with RB1 and BCL2L1. Peritumoral edema was associated with Aurora A/GADD45A, BCL2L1, CCNE1, and FOXA1. Heterogeneous internal nonmass enhancement was associated with EGFR, PI3K, AKT1, HF/MET, and EGFR/Erbb4/neuregulin 1. Diffuse nonmass enhancement was associated with HGF/MET/MUC20/SHIP

  5. Genome annotation provides insight into carbon monoxide and hydrogen metabolism in Rubrivivax gelatinosus

    SciTech Connect

    Wawrousek, Karen; Noble, Scott; Korlach, Jonas; Eckert, Carrie; Yu, Jianping; Maness, Pin -Ching

    2014-12-05

    In this article, we report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzyme complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Finally and collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth.

  6. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene...

  7. Whole-Genome Sequence and Annotation of Salmonella enterica subsp. enterica Serovar Enteritidis Phage Type 8 Strain EN1660

    PubMed Central

    Perry, Benjamin J.; Fitzgerald, Stephen F.; Kröger, Carsten

    2017-01-01

    ABSTRACT The genome of Salmonella enterica subspecies enterica serovar Enteritidis phage type 8 strain EN1660, isolated from an outbreak in Thunder Bay, Canada, was sequenced to 46-fold coverage using an Illumina MiSeq with 300-bp paired-end sequencing chemistry to produce 28 contigs with an N50 value of 490,721 bp. PMID:28126943

  8. Interspecific Comparison and annotation of two complete mitochondrial genome sequences from the plant pathogenic fungus Mycosphaerella graminicola

    SciTech Connect

    Millenbaugh, Bonnie A; Pangilinan, Jasmyn L.; Torriani, Stefano F.F.; Goodwin, Stephen B.; Kema, Gert H.J.; McDonald, Bruce A.

    2007-12-07

    The mitochondrial genomes of two isolates of the wheat pathogen Mycosphaerella graminicola were sequenced completely and compared to identify polymorphic regions. This organism is of interest because it is phylogenetically distant from other fungi with sequenced mitochondrial genomes and it has shown discordant patterns of nuclear and mitochondrial diversity. The mitochondrial genome of M. graminicola is a circular molecule of approximately 43,960 bp containing the typical genes coding for 14 proteins related to oxidative phosphorylation, one RNA polymerase, two rRNA genes and a set of 27 tRNAs. The mitochondrial DNA of M. graminicola lacks the gene encoding the putative ribosomal protein (rps5-like), commonly found in fungal mitochondrial genomes. Most of the tRNA genes were clustered with a gene order conserved with many other ascomycetes. A sample of thirty-five additional strains representing the known global mt diversity was partially sequenced to measure overall mitochondrial variability within the species. Little variation was found, confirming previous RFLP-based findings of low mitochondrial diversity. The mitochondrial sequence of M. graminicola is the first reported from the family Mycosphaerellaceae or the order Capnodiales. The sequence also provides a tool to better understand the development of fungicide resistance and the conflicting pattern of high nuclear and low mitochondrial diversity in global populations of this fungus.

  9. Complete Annotated Genome Sequences of Three Campylobacter jejuni Strains Isolated from Naturally Colonized Farm-Raised Chickens

    PubMed Central

    Taveirne, Michael E.; Dunham, Drew T.; Perault, Andrew; Beauchamp, Jessica M.; Huynh, Steven; Parker, Craig T.

    2017-01-01

    ABSTRACT Campylobacter jejuni is a leading cause of bacterially derived foodborne illness. Human illness is commonly associated with the handling and consumption of contaminated poultry products. Three C. jejuni strains were isolated from cecal contents of three different naturally colonized farm-raised chickens. The complete genomes of these three isolates are presented here. PMID:28126931

  10. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs

    PubMed Central

    Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.

    2013-01-01

    Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1−FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621

  11. The evolution of the natural killer complex; a comparison between mammals using new high-quality genome assemblies and targeted annotation.

    PubMed

    Schwartz, John C; Gibson, Mark S; Heimeier, Dorothea; Koren, Sergey; Phillippy, Adam M; Bickhart, Derek M; Smith, Timothy P L; Medrano, Juan F; Hammond, John A

    2017-04-01

    Natural killer (NK) cells are a diverse population of lymphocytes with a range of biological roles including essential immune functions. NK cell diversity is in part created by the differential expression of cell surface receptors which modulate activation and function, including multiple subfamilies of C-type lectin receptors encoded within the NK complex (NKC). Little is known about the gene content of the NKC beyond rodent and primate lineages, other than it appears to be extremely variable between mammalian groups. We compared the NKC structure between mammalian species using new high-quality draft genome assemblies for cattle and goat; re-annotated sheep, pig, and horse genome assemblies; and the published human, rat, and mouse lemur NKC. The major NKC genes are largely in the equivalent positions in all eight species, with significant independent expansions and deletions between species, allowing us to propose a model for NKC evolution during mammalian radiation. The ruminant species, cattle and goats, have independently evolved a second KLRC locus flanked by KLRA and KLRJ, and a novel KLRH-like gene has acquired an activating tail. This novel gene has duplicated several times within cattle, while other activating receptor genes have been selectively disrupted. Targeted genome enrichment in cattle identified varying levels of allelic polymorphism between the NKC genes concentrated in the predicted extracellular ligand-binding domains. This novel recombination and allelic polymorphism is consistent with NKC evolution under balancing selection, suggesting that this diversity influences individual immune responses and may impact on differential outcomes of pathogen infection and vaccination.

  12. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.

  13. Genome-wide annotation, expression profiling, and protein interaction studies of the core cell-cycle genes in Phalaenopsis aphrodite.

    PubMed

    Lin, Hsiang-Yin; Chen, Jhun-Chen; Wei, Miao-Ju; Lien, Yi-Chen; Li, Huang-Hsien; Ko, Swee-Suak; Liu, Zin-Huang; Fang, Su-Chiung

    2014-01-01

    Orchidaceae is one of the most abundant and diverse families in the plant kingdom and its unique developmental patterns have drawn the attention of many evolutionary biologists. Particular areas of interest have included the co-evolution of pollinators and distinct floral structures, and symbiotic relationships with mycorrhizal flora. However, comprehensive studies to decipher the molecular basis of growth and development in orchids remain scarce. Cell proliferation governed by cell-cycle regulation is fundamental to growth and development of the plant body. We took advantage of recently released transcriptome information to systematically isolate and annotate the core cell-cycle regulators in the moth orchid Phalaenopsis aphrodite. Our data verified that Phalaenopsis cyclin-dependent kinase A (CDKA) is an evolutionarily conserved CDK. Expression profiling studies suggested that core cell-cycle genes functioning during the G1/S, S, and G2/M stages were preferentially enriched in the meristematic tissues that have high proliferation activity. In addition, subcellular localization and pairwise interaction analyses of various combinations of CDKs and cyclins, and of E2 promoter-binding factors and dimerization partners confirmed interactions of the functional units. Furthermore, our data showed that expression of the core cell-cycle genes was coordinately regulated during pollination-induced reproductive development. The data obtained establish a fundamental framework for study of the cell-cycle machinery in Phalaenopsis orchids.

  14. Genomewide annotation and comparative genomics of cytochrome P450 monooxygenases (P450s) in the polypore species Bjerkandera adusta, Ganoderma sp. and Phlebia brevispora.

    PubMed

    Syed, Khajamohiddin; Nelson, David R; Riley, Robert; Yadav, Jagjit S

    2013-01-01

    Genomewide annotation of cytochrome P450 monooxygenases (P450s) in three white-rot species of the fungal order Polyporales, namely Bjerkandera adusta, Ganoderma sp. and Phlebia brevispora, revealed a large contingent of P450 genes (P450ome) in their genomes. A total of 199 P450 genes in B. adusta and 209 P450 genes each in Ganoderma sp. and P. brevispora were identified. These P450omes were classified into families and subfamilies as follows: B. adusta (39 families, 86 subfamilies), Ganoderma sp. (41 families, 105 subfamilies) and P. brevispora (42 families, 111 subfamilies). Of note, the B. adusta genome lacked the CYP505 family (P450foxy), a group of P450-CPR fusion proteins. The three polypore species revealed differential enrichment of individual P450 families in their genomes. The largest CYP families in the three genomes were CYP5144 (67 P450s), CYP5359 (46 P450s) and CYP5344 (43 P450s) in B. adusta, Ganoderma sp. and P. brevispora, respectively. Our analyses showed that tandem gene duplications led to expansions in certain P450 families. An estimated 33% (72 P450s), 28% (55 P450s) and 23% (49 P450s) of P450ome genes were duplicated in P. brevispora, B. adusta and Ganoderma sp., respectively. Family-wise comparative analysis revealed that 22 CYP families are common across the three Polypore species. Comparative P450ome analysis with Ganoderma lucidum revealed the presence of 143 orthologs and 56 paralogs in Ganoderma sp. Multiple P450s were found near the characteristic biosynthetic genes for secondary metabolites, namely polyketide synthase (PKS), non-ribosomal peptide synthetase (NRPS), terpene cyclase and terpene synthase in the three genomes, suggesting a likely role of these P450s in secondary metabolism in these Polyporales. Overall, the three species had a richer P450 diversity both in terms of the P450 genes and P450 subfamilies as compared to the model white-rot and brown-rot polypore species Phanerochaete chrysosporium and Postia placenta.

  15. Metabolic potential of the organic-solvent tolerant Pseudomonas putida DOT-T1E deduced from its annotated genome

    PubMed Central

    Udaondo, Zulema; Molina, Lazaro; Daniels, Craig; Gómez, Manuel J; Molina-Henares, María A; Matilla, Miguel A; Roca, Amalia; Fernández, Matilde; Duque, Estrella; Segura, Ana; Ramos, Juan Luis

    2013-01-01

    Summary Pseudomonas putida DOT-T1E is an organic solvent tolerant strain capable of degrading aromatic hydrocarbons. Here we report the DOT-T1E genomic sequence (6 394 153 bp) and its metabolic atlas based on the classification of enzyme activities. The genome encodes for at least 1751 enzymatic reactions that account for the known pattern of C, N, P and S utilization by this strain. Based on the potential of this strain to thrive in the presence of organic solvents and the subclasses of enzymes encoded in the genome, its metabolic map can be drawn and a number of potential biotransformation reactions can be deduced. This information may prove useful for adapting desired reactions to create value-added products. This bioengineering potential may be realized via direct transformation of substrates, or may require genetic engineering to block an existing pathway, or to re-organize operons and genes, as well as possibly requiring the recruitment of enzymes from other sources to achieve the desired transformation. Funding Information Work in our laboratory was supported by Fondo Social Europeo and Fondos FEDER from the European Union, through several projects (BIO2010-17227, Consolider-Ingenio CSD2007-00005, Excelencia 2007 CVI-3010, Excelencia 2011 CVI-7391 and EXPLORA BIO2011-12776-E). PMID:23815283

  16. Annotated Videography.

    ERIC Educational Resources Information Center

    United States Holocaust Memorial Museum, Washington, DC.

    This annotated list of 43 videotapes recommended for classroom use addresses various themes for teaching about the Holocaust, including: (1) overviews of the Holocaust; (2) life before the Holocaust; (3) propaganda; (4) racism, anti-Semitism; (5) "enemies of the state"; (6) ghettos; (7) camps; (8) genocide; (9) rescue; (10) resistance;…

  17. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences.

    PubMed

    Medema, Marnix H; Blin, Kai; Cimermancic, Peter; de Jager, Victor; Zakrzewski, Piotr; Fischbach, Michael A; Weber, Tilmann; Takano, Eriko; Breitling, Rainer

    2011-07-01

    Bacterial and fungal secondary metabolism is a rich source of novel bioactive compounds with potential pharmaceutical applications as antibiotics, anti-tumor drugs or cholesterol-lowering drugs. To find new drug candidates, microbiologists are increasingly relying on sequencing genomes of a wide variety of microbes. However, rapidly and reliably pinpointing all the potential gene clusters for secondary metabolites in dozens of newly sequenced genomes has been extremely challenging, due to their biochemical heterogeneity, the presence of unknown enzymes and the dispersed nature of the necessary specialized bioinformatics tools and resources. Here, we present antiSMASH (antibiotics & Secondary Metabolite Analysis Shell), the first comprehensive pipeline capable of identifying biosynthetic loci covering the whole range of known secondary metabolite compound classes (polyketides, non-ribosomal peptides, terpenes, aminoglycosides, aminocoumarins, indolocarbazoles, lantibiotics, bacteriocins, nucleosides, beta-lactams, butyrolactones, siderophores, melanins and others). It aligns the identified regions at the gene cluster level to their nearest relatives from a database containing all other known gene clusters, and integrates or cross-links all previously available secondary-metabolite specific gene analysis methods in one interactive view. antiSMASH is available at http://antismash.secondarymetabolites.org.

  18. Annotation of differentially expressed genes in the somatic embryogenesis of musa and their location in the banana genome.

    PubMed

    Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-Graciamedrano, Rosa Maria

    2013-01-01

    Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100-4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa.

  19. Annotation of Differentially Expressed Genes in the Somatic Embryogenesis of Musa and Their Location in the Banana Genome

    PubMed Central

    Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-GraciaMedrano, Rosa Maria

    2013-01-01

    Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa. PMID:24027442

  20. Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions

    PubMed Central

    Araújo, Wagner L.; Nunes-Nesi, Adriano; Williams, Thomas C. R.

    2012-01-01

    The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics, and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review, we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g., photosynthesis, photorespiration, and nitrogen metabolism). We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications. PMID:22973288

  1. Gene Ontology Annotations and Resources

    PubMed Central

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  2. annot8r: GO, EC and KEGG annotation of EST datasets

    PubMed Central

    Schmid, Ralf; Blaxter, Mark L

    2008-01-01

    Background The expressed sequence tag (EST) methodology is an attractive option for the generation of sequence data for species for which no completely sequenced genome is available. The annotation and comparative analysis of such datasets poses a formidable challenge for research groups that do not have the bioinformatics infrastructure of major genome sequencing centres. Therefore, there is a need for user-friendly tools to facilitate the annotation of non-model species EST datasets with well-defined ontologies that enable meaningful cross-species comparisons. To address this, we have developed annot8r, a platform for the rapid annotation of EST datasets with GO-terms, EC-numbers and KEGG-pathways. Results annot8r automatically downloads all files relevant for the annotation process and generates a reference database that stores UniProt entries, their associated Gene Ontology (GO), Enzyme Commission (EC) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) annotation and additional relevant data. For each of GO, EC and KEGG, annot8r extracts a specific sequence subset from the UniProt dataset based on the information stored in the reference database. These three subsets are then formatted for BLAST searches. The user provides the protein or nucleotide sequences to be annotated and annot8r runs BLAST searches against these three subsets. The BLAST results are parsed and the corresponding annotations retrieved from the reference database. The annotations are saved both as flat files and also in a relational postgreSQL results database to facilitate more advanced searches within the results. annot8r is integrated with the PartiGene suite of EST analysis tools. Conclusion annot8r is a tool that assigns GO, EC and KEGG annotations for data sets resulting from EST sequencing projects both rapidly and efficiently. The benefits of an underlying relational database, flexibility and the ease of use of the program make it ideally suited for non-model species EST

  3. MicroScope--an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data.

    PubMed

    Vallenet, David; Belda, Eugeni; Calteau, Alexandra; Cruveiller, Stéphane; Engelen, Stefan; Lajus, Aurélie; Le Fèvre, François; Longin, Cyrille; Mornico, Damien; Roche, David; Rouy, Zoé; Salvignol, Gregory; Scarpelli, Claude; Thil Smith, Adam Alexander; Weiman, Marion; Médigue, Claudine

    2013-01-01

    MicroScope is an integrated platform dedicated to both the methodical updating of microbial genome annotation and to comparative analysis. The resource provides data from completed and ongoing genome projects (automatic and expert annotations), together with data sources from post-genomic experiments (i.e. transcriptomics, mutant collections) allowing users to perfect and improve the understanding of gene functions. MicroScope (http://www.genoscope.cns.fr/agc/microscope) combines tools and graphical interfaces to analyse genomes and to perform the manual curation of gene annotations in a comparative context. Since its first publication in January 2006, the system (previously named MaGe for Magnifying Genomes) has been continuously extended both in terms of data content and analysis tools. The last update of MicroScope was published in 2009 in the Database journal. Today, the resource contains data for >1600 microbial genomes, of which ∼300 are manually curated and maintained by biologists (1200 personal accounts today). Expert annotations are continuously gathered in the MicroScope database (∼50 000 a year), contributing to the improvement of the quality of microbial genomes annotations. Improved data browsing and searching tools have been added, original tools useful in the context of expert annotation have been developed and integrated and the website has been significantly redesigned to be more user-friendly. Furthermore, in the context of the European project Microme (Framework Program 7 Collaborative Project), MicroScope is becoming a resource providing for the curation and analysis of both genomic and metabolic data. An increasing number of projects are related to the study of environmental bacterial (meta)genomes that are able to metabolize a large variety of chemical compounds that may be of high industrial interest.

  4. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens.

    PubMed

    Qhanya, Lehlohonolo Benedict; Matowane, Godfrey; Chen, Wanping; Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  5. Genome-Wide Annotation and Comparative Analysis of Cytochrome P450 Monooxygenases in Basidiomycete Biotrophic Plant Pathogens

    PubMed Central

    Sun, Yuxin; Letsimo, Elizabeth Mpholoseng; Parvez, Mohammad; Yu, Jae-Hyuk; Mashele, Samson Sitheni; Syed, Khajamohiddin

    2015-01-01

    Fungi are an exceptional source of diverse and novel cytochrome P450 monooxygenases (P450s), heme-thiolate proteins, with catalytic versatility. Agaricomycotina saprophytes have yielded most of the available information on basidiomycete P450s. This resulted in observing similar P450 family types in basidiomycetes with few differences in P450 families among Agaricomycotina saprophytes. The present study demonstrated the presence of unique P450 family patterns in basidiomycete biotrophic plant pathogens that could possibly have originated from the adaptation of these species to different ecological niches (host influence). Systematic analysis of P450s in basidiomycete biotrophic plant pathogens belonging to three different orders, Agaricomycotina (Armillaria mellea), Pucciniomycotina (Melampsora laricis-populina, M. lini, Mixia osmundae and Puccinia graminis) and Ustilaginomycotina (Ustilago maydis, Sporisorium reilianum and Tilletiaria anomala), revealed the presence of numerous putative P450s ranging from 267 (A. mellea) to 14 (M. osmundae). Analysis of P450 families revealed the presence of 41 new P450 families and 27 new P450 subfamilies in these biotrophic plant pathogens. Order-level comparison of P450 families between biotrophic plant pathogens revealed the presence of unique P450 family patterns in these organisms, possibly reflecting the characteristics of their order. Further comparison of P450 families with basidiomycete non-pathogens confirmed that biotrophic plant pathogens harbour the unique P450 families in their genomes. The CYP63, CYP5037, CYP5136, CYP5137 and CYP5341 P450 families were expanded in A. mellea when compared to other Agaricomycotina saprophytes and the CYP5221 and CYP5233 P450 families in P. graminis and M. laricis-populina. The present study revealed that expansion of these P450 families is due to paralogous evolution of member P450s. The presence of unique P450 families in these organisms serves as evidence of how a host

  6. Annotated chemical patent corpus: a gold standard for text mining.

    PubMed

    Akhondi, Saber A; Klenner, Alexander G; Tyrchan, Christian; Manchala, Anil K; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A R P; Sayle, Roger; Kors, Jan A; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

  7. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

    PubMed Central

    Akhondi, Saber A.; Klenner, Alexander G.; Tyrchan, Christian; Manchala, Anil K.; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A. R. P.; Sayle, Roger; Kors, Jan A.; Muresan, Sorel

    2014-01-01

    Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. PMID:25268232

  8. Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists[w

    PubMed Central

    Bajic, Vladimir B.; Veronika, Merlin; Veladandi, Pardha Sarathi; Meka, Archana; Heng, Mok-Wei; Rajaraman, Kanagasabai; Pan, Hong; Swarup, Sanjay

    2005-01-01

    We introduce a tool for text mining, Dragon Plant Biology Explorer (DPBE) that integrates information on Arabidopsis (Arabidopsis thaliana) genes with their functions, based on gene ontologies and biochemical entity vocabularies, and presents the associations as interactive networks. The associations are based on (1) user-provided PubMed abstracts; (2) a list of Arabidopsis genes compiled by The Arabidopsis Information Resource; (3) user-defined combinations of four vocabulary lists based on the ones developed by the general, plant, and Arabidopsis GO consortia; and (4) three lists developed here based on metabolic pathways, enzymes, and metabolites derived from AraCyc, BRENDA, and other metabolism databases. We demonstrate how various combinations can be applied to fields of (1) gene function and gene interaction analyses, (2) plant development, (3) biochemistry and metabolism, and (4) pharmacology of bioactive compounds. Furthermore, we show the suitability of DPBE for systems approaches by integration with “omics” platform outputs. Using a list of abiotic stress-related genes identified by microarray experiments, we show how this tool can be used to rapidly build an information base on the previously reported relationships. This tool complements the existing biological resources for systems biology by identifying potentially novel associations using text analysis between cellular entities based on genome annotation terms. Thus, it allows researchers to efficiently summarize existing information for a group of genes or pathways, so as to make better informed choices for designing validation experiments. Last, DPBE can be helpful for beginning researchers and graduate students to summarize vast information in an unfamiliar area. DPBE is freely available for academic and nonprofit users at http://research.i2r.a-star.edu.sg/DRAGON/ME2/. PMID:16172098

  9. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes

    PubMed Central

    Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine

    2017-01-01

    The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations. PMID:27899624

  10. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes.

    PubMed

    Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine

    2017-01-04

    The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations.

  11. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  12. Ranking biomedical annotations with annotator's semantic relevancy.

    PubMed

    Wu, Aihua

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.

  13. Annotation of Fusarium graminearum (PH-1) Version 5.0

    PubMed Central

    Hammond-Kosack, Kim E.

    2017-01-01

    ABSTRACT Fusarium graminearum floral infections are a major risk to the global supply of safe cereal grains. We report updates to the PH-1 reference genome and significant improvements to the annotation. Changes include introduction of legacy annotation identifiers, new gene models, secretome and effectorP predictions, and inclusion of extensive untranslated region (UTR) annotations. PMID:28082505

  14. Phylogenetic molecular function annotation

    NASA Astrophysics Data System (ADS)

    Engelhardt, Barbara E.; Jordan, Michael I.; Repo, Susanna T.; Brenner, Steven E.

    2009-07-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

  15. Phylogenetic molecular function annotation

    PubMed Central

    Engelhardt, Barbara E; Jordan, Michael I; Repo, Susanna T; Brenner, Steven E

    2010-01-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called “phylogenomics”) is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods. PMID:20664722

  16. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  17. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities.

    PubMed

    Falk, Marni J; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T; Stassen, Alphons P M; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G; Brilhante, Virginia; Ralph, David; DaRe, Jeana T; Shelton, Robert; Terry, Sharon F; Zhang, Zhe; Copeland, William C; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2015-03-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The "Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium" is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through the use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is

  18. SpeedSeq: Ultra-fast personal genome analysis and interpretation

    PubMed Central

    Chiang, Colby; Layer, Ryan M.; Faust, Gregory G.; Lindberg, Michael R.; Rose, David B.; Garrison, Erik P.; Marth, Gabor T.; Quinlan, Aaron R.; Hall, Ira M.

    2015-01-01

    SpeedSeq is an open-source genome analysis platform that accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 hours on a low-cost server, alleviating a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement. SpeedSeq offers competitive or superior performance to current methods for detecting germline and somatic single nucleotide variants, indels, and structural variants, and includes novel functionality for streamlined interpretation. PMID:26258291

  19. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp.

  20. Maize - GO annotation methods, evaluation, and review (Maize-GAMER)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...

  1. The Integrated Microbial Genomes (IMG) System: An Expanding Comparative Analysis Resource

    SciTech Connect

    Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Grechkin, Yuri; Ratner, Anna; Anderson, Iain; Lykidis, Athanasios; Mavromatis, Konstantinos; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2009-09-13

    The integrated microbial genomes (IMG) system serves as a community resource for comparative analysis of publicly available genomes in a comprehensive integrated context. IMG contains both draft and complete microbial genomes integrated with other publicly available genomes from all three domains of life, together with a large number of plasmids and viruses. IMG provides tools and viewers for analyzing and reviewing the annotations of genes and genomes in a comparative context. Since its first release in 2005, IMG's data content and analytical capabilities have been constantly expanded through regular releases. Several companion IMG systems have been set up in order to serve domain specific needs, such as expert review of genome annotations. IMG is available at .

  2. The platypus genome unraveled.

    PubMed

    O'Brien, Stephen J

    2008-06-13

    The genome of the platypus has been sequenced, assembled, and annotated by an international genomics team. Like the animal itself the platypus genome contains an amalgam of mammal, reptile, and bird-like features.

  3. Visualizing GO Annotations.

    PubMed

    Supek, Fran; Škunca, Nives

    2017-01-01

    Contemporary techniques in biology produce readouts for large numbers of genes simultaneously, the typical example being differential gene expression measurements. Moreover, those genes are often richly annotated using GO terms that describe gene function and that can be used to summarize the results of the genome-scale experiments. However, making sense of such GO enrichment analyses may be challenging. For instance, overrepresented GO functions in a set of differentially expressed genes are typically output as a flat list, a format not adequate to capture the complexities of the hierarchical structure of the GO annotation labels.In this chapter, we survey various methods to visualize large, difficult-to-interpret lists of GO terms. We catalog their availability-Web-based or standalone, the main principles they employ in summarizing large lists of GO terms, and the visualization styles they support. These brief commentaries on each software are intended as a helpful inventory, rather than comprehensive descriptions of the underlying algorithms. Instead, we show examples of their use and suggest that the choice of an appropriate visualization tool may be crucial to the utility of GO in biological discovery.

  4. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/.

  5. Expert Biogeographers

    ERIC Educational Resources Information Center

    Bednarski, Marsha

    2006-01-01

    This article describes an alternative way of teaching about biomes by having students become expert biogeographers. In order to become experts students need to first find out what a biogeographer does. Doing an online search lets students find out for themselves what the responsibilities are of people who work in this field. A good place to visit…

  6. yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes.

    PubMed

    Wilkerson, Matthew D; Schlueter, Shannon D; Brendel, Volker

    2006-01-01

    Your Gene structure Annotation Tool for Eukaryotes (yrGATE) provides an Annotation Tool and Community Utilities for worldwide web-based community genome and gene annotation. Annotators can evaluate gene structure evidence derived from multiple sources to create gene structure annotations. Administrators regulate the acceptance of annotations into published gene sets. yrGATE is designed to facilitate rapid and accurate annotation of emerging genomes as well as to confirm, refine, or correct currently published annotations. yrGATE is highly portable and supports different standard input and output formats. The yrGATE software and usage cases are available at http://www.plantgdb.org/prj/yrGATE.

  7. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    SciTech Connect

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resulting functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.

  8. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; ...

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  9. Analysis and Annotation of Nucleic Acid Sequence

    SciTech Connect

    States, David J.

    2004-07-28

    The aims of this project were to develop improved methods for computational genome annotation and to apply these methods to improve the annotation of genomic sequence data with a specific focus on human genome sequencing. The project resulted in a substantial body of published work. Notable contributions of this project were the identification of basecalling and lane tracking as error processes in genome sequencing and contributions to improved methods for these steps in genome sequencing. This technology improved the accuracy and throughput of genome sequence analysis. Probabilistic methods for physical map construction were developed. Improved methods for sequence alignment, alternative splicing analysis, promoter identification and NF kappa B response gene prediction were also developed.

  10. Dictionary-driven protein annotation.

    PubMed

    Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

    2002-09-01

    Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were

  11. Comparative genomic survey, exon-intron annotation and phylogenetic analysis of NAT-homologous sequences in archaea, protists, fungi, viruses, and invertebrates

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We have previously published extensive genomic surveys [1-3], reporting NAT-homologous sequences in hundreds of sequenced bacterial, fungal and vertebrate genomes. We present here the results of our latest search of 2445 genomes, representing 1532 (70 archaeal, 1210 bacterial, 43 protist, 97 fungal,...

  12. Mining GO annotations for improving annotation consistency.

    PubMed

    Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, António E N; Albrecht, Mario; Falcão, André O

    2012-01-01

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

  13. Expert Seeker

    NASA Technical Reports Server (NTRS)

    Fernandez, Becerra

    2003-01-01

    Expert Seeker is a computer program of the knowledge-management-system (KMS) type that falls within the category of expertise-locator systems. The main goal of the KMS system implemented by Expert Seeker is to organize and distribute knowledge of who are the domain experts within and without a given institution, company, or other organization. The intent in developing this KMS was to enable the re-use of organizational knowledge and provide a methodology for querying existing information (including structured, semistructured, and unstructured information) in a way that could help identify organizational experts. More specifically, Expert Seeker was developed to make it possible, by use of an intranet, to do any or all of the following: Assist an employee in identifying who has the skills needed for specific projects and to determine whether the experts so identified are available. Assist managers in identifying employees who may need training opportunities. Assist managers in determining what expertise is lost when employees retire or otherwise leave. Facilitate the development of new ways of identifying opportunities for innovation and minimization of duplicated efforts. Assist employees in achieving competitive advantages through the application of knowledge-management concepts and related systems. Assist external organizations in requesting speakers for specific engagements or determining from whom they might be able to request help via electronic mail. Help foster an environment of collaboration for rapid development in today's environment, in which it is increasingly necessary to assemble teams of experts from government, universities, research laboratories, and industries, to quickly solve problems anytime, anywhere. Make experts more visible. Provide a central repository of information about employees, including information that, heretofore, has typically not been captured by the human-resources systems (e.g., information about past projects, patents, or

  14. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    SciTech Connect

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) or individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual

  15. Annotations in Refseq (GSC8 Meeting)

    ScienceCinema

    Tatusova, Tatiana

    2016-07-12

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. Tatiana Tatusova of NCBI discusses "Annotations in Refseq" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009.

  16. Annotations in Refseq (GSC8 Meeting)

    SciTech Connect

    Tatusova, Tatiana

    2009-09-10

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. Tatiana Tatusova of NCBI discusses "Annotations in Refseq" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009.

  17. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with

  18. Annotation Graphs: A Graph-Based Visualization for Meta-Analysis of Data Based on User-Authored Annotations.

    PubMed

    Zhao, Jian; Glueck, Michael; Breslav, Simon; Chevalier, Fanny; Khan, Azam

    2017-01-01

    User-authored annotations of data can support analysts in the activity of hypothesis generation and sensemaking, where it is not only critical to document key observations, but also to communicate insights between analysts. We present annotation graphs, a dynamic graph visualization that enables meta-analysis of data based on user-authored annotations. The annotation graph topology encodes annotation semantics, which describe the content of and relations between data selections, comments, and tags. We present a mixed-initiative approach to graph layout that integrates an analyst's manual manipulations with an automatic method based on similarity inferred from the annotation semantics. Various visual graph layout styles reveal different perspectives on the annotation semantics. Annotation graphs are implemented within C8, a system that supports authoring annotations during exploratory analysis of a dataset. We apply principles of Exploratory Sequential Data Analysis (ESDA) in designing C8, and further link these to an existing task typology in the visualization literature. We develop and evaluate the system through an iterative user-centered design process with three experts, situated in the domain of analyzing HCI experiment data. The results suggest that annotation graphs are effective as a method of visually extending user-authored annotations to data meta-analysis for discovery and organization of ideas.

  19. Genome-wide annotation of mutations in a phenotyped mutant library provides an efficient platform for discovery of casual gene mutations

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Ethyl methanesulfonate (EMS) efficiently generates high-density mutations in genomes. Conventionally, these mutations are identified by techniques that can detect single-nucleotide mismatches in heteroduplexes of individual PCR amplicons. We applied whole-genome sequencing to 256-phenotyped mutant l...

  20. Draft Genome Sequence and Annotation of Phyllosphere-Persisting Salmonella enterica subsp. enterica Serovar Livingstone Strain CKY-S4, Isolated from an Urban Lake in Regina, Canada.

    PubMed

    Tambalo, Dinah D; Perry, Benjamin J; Fitzgerald, Stephen F; Cameron, Andrew D S; Yost, Christopher K

    2015-08-13

    Here, we report the first draft genome sequence of Salmonella enterica subsp. enterica serovar Livingstone. This S. Livingstone strain CKY-S4 displayed biofilm formation and cellulose production and could persist on lettuce. This genome may help the study of mechanisms by which enteric pathogens colonize food crops.

  1. Isolation, Identification, Whole-Genome Sequencing, and Annotation of Four Bacillus Species, B. anthracis RIT375, B. circulans RIT379, B. altitudinis RIT380, and B. megaterium RIT381, from Internal Stem Tissue of the Insulin Plant Costus igneus

    PubMed Central

    Polter, Steven J.; Caraballo, Alexander A.; Lee, Yin P.; Eng, Wilhelm W. H.; Gan, Han M.; Wheatley, Matthew S.; Savka, Michael A.; Thomas, Bolaji N.

    2015-01-01

    Here, we report the isolation, identification, whole-genome sequencing, and annotation of four Bacillus species from internal stem tissue of the insulin plant Costus igneus, grown in Puerto Rico. The plant is of medicinal importance, as extracts from its leaves have been shown to lower blood sugar levels of hyperglycemic rats. PMID:26227604

  2. CRISPRdigger: detecting CRISPRs with better direct repeat annotations

    PubMed Central

    Ge, Ruiquan; Mai, Guoqin; Wang, Pu; Zhou, Manli; Luo, Youxi; Cai, Yunpeng; Zhou, Fengfeng

    2016-01-01

    Clustered regularly interspaced short palindromic repeats (CRISPRs) are important genetic elements in many bacterial and archaeal genomes, and play a key role in prokaryote immune systems’ fight against invasive foreign elements. The CRISPR system has also been engineered to facilitate target gene editing in eukaryotic genomes. Using the common features of mis-annotated CRISPRs in prokaryotic genomes, this study proposed an accurate de novo CRISPR annotation program CRISPRdigger, which can take a partially assembled genome as its input. A comprehensive comparison with the three existing programs demonstrated that CRISPRdigger can recover more Direct Repeats (DRs) for CRISPRs and achieve a higher accuracy for a query genome. The program was implemented by Perl and all the parameters had default values, so that a user could annotate CRISPRs in a query genome by supplying only a genome sequence in the FASTA format. All the supplementary data are available at http://www.healthinformaticslab.org/supp/. PMID:27596864

  3. Genotator: A Workbench for Sequence Annotation

    SciTech Connect

    Harris, N.L.

    1997-05-01

    Sequencing centers such as the Human Genome Center at LBNL are producing an ever-increasing flood of genetic data. Annotation can greatly enhance the biological value of these sequences. Useful annotations include possible gene locations, homologies to known genes, and gene signals such as promoters and splice sites. Genotator is a workbench for automated sequence annotation and annotation browsing. The back end runs a series of sequence analysis tools on a DNA sequence, handling the various input and output formats required by the tools. Genotator currently runs five different gene finding programs, three homology searches, and searches for promoters, splice sites, and ORFs. The results of the analyses run by Genotator can be viewed with the interactive graphical browser. The browser displays color-coded sequence annotations on a canvas that can be scrolled and zoomed, allowing the annotated sequence to be explored at multiple levels of detail. The user can view the actual DNA sequence in a separate window; when a region is selected in the map display, it is automatically highlighted in the sequence display, and vice-versa. By displaying the output of all of the sequence analyses, Genotator provides an intuitive way to identify the significant regions (for example, probable exons) in a sequence. Users can interactively add personal annotations to label regions of interest. Additional capabilities of Genotator include primer design and pattern searching.

  4. High-Quality Genome Assembly and Annotation for Plasmodium coatneyi, Generated Using Single-Molecule Real-Time PacBio Technology.

    PubMed

    Chien, Jung-Ting; Pakala, Suman B; Geraldo, Juliana A; Lapp, Stacey A; Humphrey, Jay C; Barnwell, John W; Kissinger, Jessica C; Galinski, Mary R

    2016-09-01

    Plasmodium coatneyi is a protozoan parasite species that causes simian malaria and is an excellent model for studying disease caused by the human malaria parasite, P. falciparum Here we report the complete (nontelomeric) genome sequence of P. coatneyi Hackeri generated by the application of only Pacific Biosciences RS II (PacBio RS II) single-molecule real-time (SMRT) high-resolution sequence technology and assembly using the Hierarchical Genome Assembly Process (HGAP). This is the first Plasmodium genome sequence reported to use only PacBio technology. This approach has proven to be superior to short-read only approaches for this species.

  5. Gene and alternative splicing annotation with AIR

    PubMed Central

    Florea, Liliana; Di Francesco, Valentina; Miller, Jason; Turner, Russell; Yao, Alison; Harris, Michael; Walenz, Brian; Mobarry, Clark; Merkulov, Gennady V.; Charlab, Rosane; Dew, Ian; Deng, Zuoming; Istrail, Sorin; Li, Peter; Sutton, Granger

    2005-01-01

    Designing effective and accurate tools for identifying the functional and structural elements in a genome remains at the frontier of genome annotation owing to incompleteness and inaccuracy of the data, limitations in the computational models, and shifting paradigms in genomics, such as alternative splicing. We present a methodology for the automated annotation of genes and their alternatively spliced mRNA transcripts based on existing cDNA and protein sequence evidence from the same species or projected from a related species using syntenic mapping information. At the core of the method is the splice graph, a compact representation of a gene, its exons, introns, and alternatively spliced isoforms. The putative transcripts are enumerated from the graph and assigned confidence scores based on the strength of sequence evidence, and a subset of the high-scoring candidates are selected and promoted into the annotation. The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets. The process is fast, accurate, and fully automated, and combines the traditionally distinct gene annotation and alternative splicing detection processes in a comprehensive and systematic way, thus considerably aiding in the ensuing manual curation efforts. PMID:15632090

  6. Towards a Consensus Annotation System (GSC8 Meeting)

    ScienceCinema

    White, Owen [University of Maryland

    2016-07-12

    The Genomic Standards Consortium was formed in September 2005. It is an international, open-membership working body which promotes standardization in the description of genomes and the exchange and integration of genomic data. The 2009 meeting was an activity of a five-year funding "Research Coordination Network" from the National Science Foundation and was organized held at the DOE Joint Genome Institute with organizational support provided by the JGI and by the University of California - San Diego. "Comparing Annotations: Towards Consensus Annotation" at the Genomic Standards Consortium's 8th meeting at the DOE JGI in Walnut Creek, Calif. on Sept. 10, 2009

  7. In vitro and in silico annotation of conserved and nonconserved microRNAs in the genome of the marsupial Monodelphis domestica.

    PubMed

    Devor, Eric J; Samollow, Paul B

    2008-01-01

    The gray short-tailed opossum, Monodelphis domestica, is the world's most widely utilized marsupial model for biomedical research. Recent completion of the initial M. domestica genome assembly offers the first opportunity to examine genome-wide phenomena in a marsupial. Using in silico methods, we have mapped 124 conserved microRNAs (miRNAs) to 94 loci in the M. domestica genome. In addition, using RNA pooled from 5 tissues, we cloned 85 miRNAs. Seventy-two of these are conserved miRNAs that we had mapped in silico. The additional 13 are nonconserved candidate miRNAs in 11 loci. Nine of these 13 are also found in the wallaby (Macropus eugenii) genome. Two of the candidate miRNA clones, located on the X chromosome, are part of a cluster containing a total of 24 potential miRNAs spanning more than 100 kb.

  8. k-merSNP discovery: Software for alignment-and reference-free scalable SNP discovery, phylogenetics, and annotation for hundreds of microbial genomes

    SciTech Connect

    2014-11-18

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and trees determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.

  9. Genome-wide Annotation and Comparative Analysis of Long Terminal Repeat Retrotransposons between Pear Species of P. bretschneideri and P. Communis

    PubMed Central

    Yin, Hao; Du, Jianchang; Wu, Jun; Wei, Shuwei; Xu, Yingxiu; Tao, Shutian; Wu, Juyou; Zhang, Shaoling

    2015-01-01

    Recent sequencing of the Oriental pear (P. bretschneideri Rehd.) genome and the availability of the draft genome sequence of Occidental pear (P. communis L.), has provided a good opportunity to characterize the abundance, distribution, timing, and evolution of long terminal repeat retrotransposons (LTR-RTs) in these two important fruit plants. Here, a total of 7247 LTR-RTs, which can be classified into 148 families, have been identified in the assembled Oriental pear genome. Unlike in other plant genomes, approximately 90% of these elements were found to be randomly distributed along the pear chromosomes. Further analysis revealed that the amplification timeframe of elements varies dramatically in different families, super-families and lineages, and the Copia-like elements have highest activity in the recent 0.5 million years (Mys). The data also showed that two genomes evolved with similar evolutionary rates after their split from the common ancestor ~0.77–1.66 million years ago (Mya). Overall, the data provided here will be a valuable resource for further investigating the impact of transposable elements on gene structure, expression, and epigenetic modification in the pear genomes. PMID:26631625

  10. Expert judgment and expert systems

    SciTech Connect

    Mumpower, J.; Phillips, L.D.; Renn, O.; Uppuluri, V.R.R.

    1987-01-01

    This volume collects researchers from the fields of psychology, decision analysis, and artificial intelligence. The purposes were to assess similarities, differences, and complementarities among the three approaches to the study of expert judgment; to evaluate their relative strengths and weaknesses; and to propose profitable linkages between them. Each of the papers in the present volume is directed toward one or more of these goals.

  11. Insights into the annotated genome sequence of Methanoculleus bourgensis MS2(T), related to dominant methanogens in biogas-producing plants.

    PubMed

    Maus, Irena; Wibberg, Daniel; Stantscheff, Robbin; Stolze, Yvonne; Blom, Jochen; Eikmeyer, Felix-Gregor; Fracowiak, Jochen; König, Helmut; Pühler, Alfred; Schlüter, Andreas

    2015-05-10

    The final step of the biogas production process, the methanogenesis, is frequently dominated by members of the genus Methanoculleus. In particular, the species Methanoculleus bourgensis was identified to play a role in different biogas reactor systems. The genome of the type strain M. bourgensis MS2(T), originally isolated from a sewage sludge digestor, was completely sequenced to analyze putative adaptive genome features conferring competitiveness within biogas reactor environments to the strain. Sequencing and assembly of the M. bourgensis MS2(T) genome yielded a chromosome with a size of 2,789,773 bp. Comparative analysis of M. bourgensis MS2(T) and Methanoculleus marisnigri JR1 revealed significant similarities. The absence of genes for a putative ammonium uptake system may indicate that M. bourgensis MS2(T) is adapted to environments rich in ammonium/ammonia. Specific genes featuring predicted functions in the context of osmolyte production were detected in the genome of M. bourgensis MS2(T). Mapping of metagenome sequences derived from a production-scale biogas plant revealed that M. bourgensis MS2(T) almost completely comprises the genetic information of dominant methanogens present in the biogas reactor analyzed. Hence, availability of the M. bourgensis MS2(T) genome sequence may be valuable regarding further research addressing the performance of Methanoculleus species in agricultural biogas plants.

  12. Computing human image annotation.

    PubMed

    Channin, David S; Mongkolwat, Pattanasak; Kleper, Vladimir; Rubin, Daniel L

    2009-01-01

    An image annotation is the explanatory or descriptive information about the pixel data of an image that is generated by a human (or machine) observer. An image markup is the graphical symbols placed over the image to depict an annotation. In the majority of current, clinical and research imaging practice, markup is captured in proprietary formats and annotations are referenced only in free text radiology reports. This makes these annotations difficult to query, retrieve and compute upon, hampering their integration into other data mining and analysis efforts. This paper describes the National Cancer Institute's Cancer Biomedical Informatics Grid's (caBIG) Annotation and Image Markup (AIM) project, focusing on how to use AIM to query for annotations. The AIM project delivers an information model for image annotation and markup. The model uses controlled terminologies for important concepts. All of the classes and attributes of the model have been harmonized with the other models and common data elements in use at the National Cancer Institute. The project also delivers XML schemata necessary to instantiate AIMs in XML as well as a software application for translating AIM XML into DICOM S/R and HL7 CDA. Large collections of AIM annotations can be built and then queried as Grid or Web services. Using the tools of the AIM project, image annotations and their markup can be captured and stored in human and machine readable formats. This enables the inclusion of human image observation and inference as part of larger data mining and analysis activities.

  13. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation

    PubMed Central

    Kristensen, David M.; Wolf, Yuri I.; Koonin, Eugene V.

    2017-01-01

    The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of ‘index’ orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. PMID:28053163

  14. Genome sequence and annotation of Trichoderma parareesei, the ancestor of the cellulase producer Trichoderma reesei

    SciTech Connect

    Yang, Dongqing; Pomraning, Kyle; Kopchinskiy, Alexey; Karimi, Aghcheh Razieh; Atanasova, Lea; Chenthamara, Komal; Baker, Scott E.; Zhang, Ruifu; Shen, Qirong; Freitag, Michael; Kubicek, Christian P.; Druzhinina, Irina S.

    2015-08-13

    The filamentous fungus Trichoderma parareesei is the asexually reproducing ancestor of Trichoderma reesei, the holomorphic industrial producer of cellulase and hemicellulase. Here, we present the genome sequence of the T. parareesei type strain CBS 125925, which contains genes for 9,318 proteins.

  15. The Genomic Landscape of TP53 and p53 Annotated High Grade Ovarian Serous Carcinomas from a Defined Founder Population Associated with Patient Outcome

    PubMed Central

    Wojnarowicz, Paulina M.; Oros, Kathleen Klein; Quinn, Michael C. J.; Arcand, Suzanna L.; Gambaro, Karen; Madore, Jason; Birch, Ashley H.; de Ladurantaye, Manon; Rahimi, Kurosh; Provencher, Diane M.; Mes-Masson, Anne-Marie; Greenwood, Celia M. T.; Tonin, Patricia N.

    2012-01-01

    High-grade ovarian serous carcinomas (HGSC) are characterized by TP53 mutations and non-random patterns of chromosomal anomalies, where the nature of the TP53 mutation may correlate with clinical outcome. However, the frequency of common somatic genomic events occurring in HGSCs from demographically defined populations has not been explored. Whole genome SNP array, and TP53 mutation, gene and protein expression analyses were assessed in 87 confirmed HGSC samples with clinical correlates from French Canadians, a population exhibiting strong founder effects, and results were compared with independent reports describing similar analyses from unselected populations. TP53 mutations were identified in 91% of HGSCs. Anomalies observed in more than 50% of TP53 mutation-positive HGSCs involved gains of 3q, 8q and 20q, and losses of 4q, 5q, 6q, 8p, 13q, 16q, 17p, 17q, 22q and Xp. Nearly 400 regions of non-overlapping amplification or deletion were identified, where 178 amplifications and 98 deletions involved known genes. The subgroup expressing mutant p53 protein exhibited significantly prolonged overall and disease-free survival as compared with the p53 protein null subgroup. Interestingly, a comparative analysis of genomic landscapes revealed a significant enrichment of gains involving 1q, 8q, and 12p intervals in the subgroup expressing mutant p53 protein as compared with the p53 protein null subgroup. Although the findings show that the frequency of TP53 mutations and the genomic landscapes observed in French Canadian samples were similar to those reported for samples from unselected populations, there were differences in the magnitude of global gains/losses of specific chromosomal arms and in the spectrum of amplifications and deletions involving focal regions in individual samples. The findings from our comparative genomic analyses also support the notion that there may be biological differences between HGSCs that could be related to the nature of the TP53 mutation

  16. The GATO gene annotation tool for research laboratories.

    PubMed

    Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

    2005-11-01

    Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.

  17. PhytoPath: an integrative resource for plant pathogen genomics

    PubMed Central

    Pedro, Helder; Maheswari, Uma; Urban, Martin; Irvine, Alistair George; Cuzick, Alayne; McDowall, Mark D.; Staines, Daniel M.; Kulesha, Eugene; Hammond-Kosack, Kim Elizabeth; Kersey, Paul Julian

    2016-01-01

    PhytoPath (www.phytopathdb.org) is a resource for genomic and phenotypic data from plant pathogen species, that integrates phenotypic data for genes from PHI-base, an expertly curated catalog of genes with experimentally verified pathogenicity, with the Ensembl tools for data visualization and analysis. The resource is focused on fungi, protists (oomycetes) and bacterial plant pathogens that have genomes that have been sequenced and annotated. Genes with associated PHI-base data can be easily identified across all plant pathogen species using a BioMart-based query tool and visualized in their genomic context on the Ensembl genome browser. The PhytoPath resource contains data for 135 genomic sequences from 87 plant pathogen species, and 1364 genes curated for their role in pathogenicity and as targets for chemical intervention. Support for community annotation of gene models is provided using the WebApollo online gene editor, and we are working with interested communities to improve reference annotation for selected species. PMID:26476449

  18. A Novel Approach to Semantic and Coreference Annotation at LLNL

    SciTech Connect

    Firpo, M

    2005-02-04

    A case is made for the importance of high quality semantic and coreference annotation. The challenges of providing such annotation are described. Asperger's Syndrome is introduced, and the connections are drawn between the needs of text annotation and the abilities of persons with Asperger's Syndrome to meet those needs. Finally, a pilot program is recommended wherein semantic annotation is performed by people with Asperger's Syndrome. The primary points embodied in this paper are as follows: (1) Document annotation is essential to the Natural Language Processing (NLP) projects at Lawrence Livermore National Laboratory (LLNL); (2) LLNL does not currently have a system in place to meet its need for text annotation; (3) Text annotation is challenging for a variety of reasons, many related to its very rote nature; (4) Persons with Asperger's Syndrome are particularly skilled at rote verbal tasks, and behavioral experts agree that they would excel at text annotation; and (6) A pilot study is recommend in which two to three people with Asperger's Syndrome annotate documents and then the quality and throughput of their work is evaluated relative to that of their neuro-typical peers.

  19. Biosynthesis of Akaeolide and Lorneic Acids and Annotation of Type I Polyketide Synthase Gene Clusters in the Genome of Streptomyces sp. NPS554

    PubMed Central

    Zhou, Tao; Komaki, Hisayuki; Ichikawa, Natsuko; Hosoyama, Akira; Sato, Seizo; Igarashi, Yasuhiro

    2015-01-01

    The incorporation pattern of biosynthetic precursors into two structurally unique polyketides, akaeolide and lorneic acid A, was elucidated by feeding experiments with 13C-labeled precursors. In addition, the draft genome sequence of the producer, Streptomyces sp. NPS554, was performed and the biosynthetic gene clusters for these polyketides were identified. The putative gene clusters contain all the polyketide synthase (PKS) domains necessary for assembly of the carbon skeletons. Combined with the 13C-labeling results, gene function prediction enabled us to propose biosynthetic pathways involving unusual carbon-carbon bond formation reactions. Genome analysis also indicated the presence of at least ten orphan type I PKS gene clusters that might be responsible for the production of new polyketides. PMID:25603349

  20. Annotated Humanities Programs.

    ERIC Educational Resources Information Center

    Adler, Richard R.; Applebee, Arthur

    The humanities programs offered in 1968 by 227 United States secondary schools are listed alphabetically by state, including almost 100 new programs not annotated in the 1967 listing (see TE 000 224). Each annotation presents a brief description of the approach to study used in the particular humanities course (e.g., American Studies, Culture…

  1. The annotated complete DNA sequence of Enterococcus faecalis bacteriophage φEf11 and its comparison with all available phage and predicted prophage genomes.

    PubMed

    Stevens, Roy H; Ektefaie, Mahmoud R; Fouts, Derrick E

    2011-04-01

    φEf11 is a temperate Siphoviridae bacteriophage isolated by induction from a lysogenic Enterococcus faecalis strain. The φEf11 DNA was completely sequenced and found to be 42,822 bp in length, with a G+C mol% of 34.4%. Genome analysis revealed 65 ORFs, accounting for 92.8% of the DNA content. All except for seven of the ORFs displayed sequence similarities to previously characterized proteins. The genes were arranged in functional modules, organized similar to that of several other phages of low GC Gram-positive bacteria; however, the number and arrangement of lysis-related genes were atypical of these bacteriophages. A 159 bp noncoding region between predicted cI and cro genes is highly similar to the functionally characterized early promoter region of lactococcal temperate phage TP901-1, and possessed a predicted stem-loop structure in between predicted P(L) and P(R) promoters, suggesting a novel mechanism of repression of these two bacteriophages from the λ paradigm. Comparison with all available phage and predicted prophage genomes revealed that the φEf11 genome displays unique features, suggesting that φEf11 may be a novel member of a larger family of temperate prophages that also includes lactococcal phages. Trees based on the blast score ratio grouped this family by tail fiber similarity, suggesting that these trees are useful for identifying phages with similar tail fibers.

  2. PREPACT 2.0: Predicting C-to-U and U-to-C RNA Editing in Organelle Genome Sequences with Multiple References and Curated RNA Editing Annotation

    PubMed Central

    Lenz, Henning; Knoop, Volker

    2013-01-01

    RNA editing is vast in some genetic systems, with up to thousands of targeted C-to-U and U-to-C substitutions in mitochondria and chloroplasts of certain plants. Efficient prognoses of RNA editing in organelle genomes will help to reveal overlooked cases of editing. We present PREPACT 2.0 (http://www.prepact.de) with numerous enhancements of our previously developed Plant RNA Editing Prediction & Analysis Computer Tool. Reference organelle transcriptomes for editing prediction have been extended and reorganized to include 19 curated mitochondrial and 13 chloroplast genomes, now allowing to distinguish RNA editing sites from “pre-edited” sites. Queries may be run against multiple references and a new “commons” function identifies and highlights orthologous candidate editing sites congruently predicted by multiple references. Enhancements to the BLASTX mode in PREPACT 2.0 allow querying of complete novel organelle genomes within a few minutes, identifying protein genes and candidate RNA editing sites simultaneously without prior user analyses. PMID:23362369

  3. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants

    PubMed Central

    Obenchain, Valerie; Lawrence, Michael; Carey, Vincent; Gogarten, Stephanie; Shannon, Paul; Morgan, Martin

    2014-01-01

    Summary: VariantAnnotation is an R / Bioconductor package for the exploration and annotation of genetic variants. Capabilities exist for reading, writing and filtering variant call format (VCF) files. VariantAnnotation allows ready access to additional R / Bioconductor facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources. Availability and implementation: This package is implemented in R and available for download at the Bioconductor Web site (http://bioconductor.org/packages/2.13/bioc/html/VariantAnnotation.html). The package contains extensive help pages for individual functions and a ‘vignette’ outlining typical work flows; it is made available under the open source ‘Artistic-2.0’ license. Version 1.9.38 was used in this article. Contact: vobencha@fhcrc.org PMID:24681907

  4. Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX

    PubMed Central

    Martínez Barrio, Álvaro; Lagercrantz, Erik; Sperber, Göran O; Blomberg, Jonas; Bongcam-Rudloff, Erik

    2009-01-01

    Background The Distributed Annotation System (DAS) is a widely used network protocol for sharing biological information. The distributed aspects of the protocol enable the use of various reference and annotation servers for connecting biological sequence data to pertinent annotations in order to depict an integrated view of the data for the final user. Results An annotation server has been devised to provide information about the endogenous retroviruses detected and annotated by a specialized in silico tool called RetroTector. We describe the procedure to implement the DAS 1.5 protocol commands necessary for constructing the DAS annotation server. We use our server to exemplify those steps. Data distribution is kept separated from visualization which is carried out by eBioX, an easy to use open source program incorporating multiple bioinformatics utilities. Some well characterized endogenous retroviruses are shown in two different DAS clients. A rapid analysis of areas free from retroviral insertions could be facilitated by our annotations. Conclusion The DAS protocol has shown to be advantageous in the distribution of endogenous retrovirus data. The distributed nature of the protocol is also found to aid in combining annotation and visualization along a genome in order to enhance the understanding of ERV contribution to its evolution. Reference and annotation servers are conjointly used by eBioX to provide visualization of ERV annotations as well as other data sources. Our DAS data source can be found in the central public DAS service repository, , or at . PMID:19534743

  5. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  6. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated

  7. Ontology for Genome Comparison and Genomic Rearrangements

    PubMed Central

    Flanagan, Keith; Stevens, Robert; Pocock, Matthew; Lee, Pete

    2004-01-01

    We present an ontology for describing genomes, genome comparisons, their evolution and biological function. This ontology will support the development of novel genome comparison algorithms and aid the community in discussing genomic evolution. It provides a framework for communication about comparative genomics, and a basis upon which further automated analysis can be built. The nomenclature defined by the ontology will foster clearer communication between biologists, and also standardize terms used by data publishers in the results of analysis programs. The overriding aim of this ontology is the facilitation of consistent annotation of genomes through computational methods, rather than human annotators. To this end, the ontology includes definitions that support computer analysis and automated transfer of annotations between genomes, rather than relying upon human mediation. PMID:18629137

  8. Functional annotation of mammalian genomic DNA sequence by chemical mutagenesis: a fine-structure genetic mutation map of a 1- to 2-cM segment of mouse chromosome 7 corresponding to human chromosome 11p14-p15.

    PubMed

    Rinchik, Eugene M; Carpenter, Donald A; Johnson, Dabney K

    2002-01-22

    Eleven independent, recessive, N-ethyl-N-nitrosourea-induced mutations that map to a approximately 1- to 2-cM region of mouse chromosome (Chr) 7 homologous to human Chr 11p14-p15 were recovered from a screen of 1,218 gametes. These mutations were initially identified in a hemizygous state opposite a large p-locus deletion and subsequently were mapped to finer genomic intervals by crosses to a panel of smaller p deletions. The 11 mutations also were classified into seven complementation groups by pairwise crosses. Four complementation groups were defined by seven prenatally lethal mutations, including a group (l7R3) comprised of two alleles of obvious differing severity. Two allelic mutations (at the psrt locus) result in a severe seizure and runting syndrome, but one mutation (at the fit2 locus) results in a more benign runting phenotype. This experiment has added seven loci, defined by phenotypes of presumed point mutations, to the genetic map of a small (1-2 cM) region of mouse Chr 7 and will facilitate the task of functional annotation of DNA sequence and transcription maps both in the mouse and the corresponding human 11p14-p15 homology region.

  9. ParsEval: parallel comparison and analysis of gene structure annotations

    PubMed Central

    2012-01-01

    Background Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. Results ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. Conclusions Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation comparison to date. The source code is freely available (under an ISC license) at http://parseval.sourceforge.net/. PMID:22852583

  10. Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of “Unknown Function”

    PubMed Central

    Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

    2012-01-01

    Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000 Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs. PMID:22645570

  11. Genome-Wide Computational Analysis of Musa Microsatellites: Classification, Cross-Taxon Transferability, Functional Annotation, Association with Transposons & miRNAs, and Genetic Marker Potential.

    PubMed

    Biswas, Manosh Kumar; Liu, Yuxuan; Li, Chunyu; Sheng, Ou; Mayer, Christoph; Yi, Ganjun

    2015-01-01

    The development of organized, informative, robust, user-friendly, and freely accessible molecular markers is imperative to the Musa marker assisted breeding program. Although several hundred SSR markers have already been developed, the number of informative, robust, and freely accessible Musa markers remains inadequate for some breeding applications. In view of this issue, we surveyed SSRs in four different data sets, developed large-scale non-redundant highly informative therapeutic SSR markers, and classified them according to their attributes, as well as analyzed their cross-taxon transferability and utility for the genetic study of Musa and its relatives. A high SSR frequency (177 per Mbp) was found in the Musa genome. AT-rich dinucleotide repeats are predominant, and trinucleotide repeats are the most abundant in transcribed regions. A significant number of Musa SSRs are associated with pre-miRNAs, and 83% of these SSRs are promising candidates for the development of therapeutic SSR markers. Overall, 74% of the SSR markers were polymorphic, and 94% were transferable to at least one Musa spp. Two hundred forty-three markers generated a total of 1047 alleles, with 2-8 alleles each and an average of 4.38 alleles per locus. The PIC values ranged from 0.31 to 0.89 and averaged 0.71. We report the largest set of non-redundant, polymorphic, new SSR markers to be developed in Musa. These additional markers could be a valuable resource for marker-assisted breeding, genetic diversity and genomic studies of Musa and related species.

  12. Genome-Wide Computational Analysis of Musa Microsatellites: Classification, Cross-Taxon Transferability, Functional Annotation, Association with Transposons & miRNAs, and Genetic Marker Potential

    PubMed Central

    Biswas, Manosh Kumar; Liu, Yuxuan; Li, Chunyu; Sheng, Ou; Mayer, Christoph; Yi, Ganjun

    2015-01-01

    The development of organized, informative, robust, user-friendly, and freely accessible molecular markers is imperative to the Musa marker assisted breeding program. Although several hundred SSR markers have already been developed, the number of informative, robust, and freely accessible Musa markers remains inadequate for some breeding applications. In view of this issue, we surveyed SSRs in four different data sets, developed large-scale non-redundant highly informative therapeutic SSR markers, and classified them according to their attributes, as well as analyzed their cross-taxon transferability and utility for the genetic study of Musa and its relatives. A high SSR frequency (177 per Mbp) was found in the Musa genome. AT-rich dinucleotide repeats are predominant, and trinucleotide repeats are the most abundant in transcribed regions. A significant number of Musa SSRs are associated with pre-miRNAs, and 83% of these SSRs are promising candidates for the development of therapeutic SSR markers. Overall, 74% of the SSR markers were polymorphic, and 94% were transferable to at least one Musa spp. Two hundred forty-three markers generated a total of 1047 alleles, with 2-8 alleles each and an average of 4.38 alleles per locus. The PIC values ranged from 0.31 to 0.89 and averaged 0.71. We report the largest set of non-redundant, polymorphic, new SSR markers to be developed in Musa. These additional markers could be a valuable resource for marker-assisted breeding, genetic diversity and genomic studies of Musa and related species. PMID:26121637

  13. An annotated energy bibliography

    NASA Technical Reports Server (NTRS)

    Blow, S. J.

    1979-01-01

    Comprehensive annotated compilation of books, journals, periodicals, and reports on energy and energy related topics, contains approximately 10,0000 tehcnical and nontechnical references from bibliographic and other sources dated January 1975 through May 1977.

  14. IMG/M: integrated genome and metagenome comparative data analysis system.

    PubMed

    Chen, I-Min A; Markowitz, Victor M; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N; Kyrpides, Nikos C

    2017-01-04

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.

  15. IMG/M: integrated genome and metagenome comparative data analysis system

    PubMed Central

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2017-01-01

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135

  16. Annotation of Tutorial Dialogue Goals for Natural Language Generation

    ERIC Educational Resources Information Center

    Kim, Jung Hee; Freedman, Reva; Glass, Michael; Evens, Martha W.

    2006-01-01

    We annotated transcripts of human tutoring dialogue for the purpose of constructing a dialogue-based intelligent tutoring system, CIRCSIM-Tutor. The tutors were professors of physiology who were also expert tutors. The students were 1st year medical students who communicated with the tutors using typed communication from separate rooms. The tutors…

  17. Charter Schools & Community Relations: An Annotated Video Transcript.

    ERIC Educational Resources Information Center

    Blaha, Karen Lytle, Ed.; Vincent, Susan, Ed.; Schwendiman, Jed, Ed.

    This annotated video transcript follows the discussion of six experts in community relations and charter schools. The panel was convened as part of an initiative to identify the specific needs of charter-school founders and to develop a leadership-training program designed to address those needs. Each member of the panel shared his or her insights…

  18. Genome-wide annotation and characterization of CLAVATA/ESR (CLE) peptide hormones of soybean (Glycine max) and common bean (Phaseolus vulgaris), and their orthologues of Arabidopsis thaliana

    PubMed Central

    Hastwell, April H.; Gresshoff, Peter M.; Ferguson, Brett J.

    2015-01-01

    CLE peptides are key regulators of cell proliferation and differentiation in plant shoots, roots, vasculature, and legume nodules. They are C-terminally encoded peptides that are post-translationally cleaved and modified from their corresponding pre-propeptides to produce a final ligand that is 12–13 amino acids in length. In this study, an array of bionformatic and comparative genomic approaches was used to identify and characterize the complete family of CLE peptide-encoding genes in two of the world’s most important crop species, soybean and common bean. In total, there are 84 CLE peptide-encoding genes in soybean (considerably more than the 32 present in Arabidopsis), including three pseudogenes and two multi-CLE domain genes having six putative CLE domains each. In addition, 44 CLE peptide-encoding genes were identified in common bean. In silico characterization was used to establish all soybean homeologous pairs, and to identify corresponding gene orthologues present in common bean and Arabidopsis. The soybean CLE pre-propeptide family was further analysed and separated into seven distinct groups based on structure, with groupings strongly associated with the CLE domain sequence and function. These groups provide evolutionary insight into the CLE peptide families of soybean, common bean, and Arabidopsis, and represent a novel tool that can aid in the functional characterization of the peptides. Transcriptional evidence was also used to provide further insight into the location and function of all CLE peptide-encoding members currently available in gene atlases for the three species. Taken together, this in-depth analysis helped to identify and categorize the complete CLE peptide families of soybean and common bean, established gene orthologues within the two legume species, and Arabidopsis, and provided a pla