Sample records for previously annotated genes

  1. Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations.

    PubMed

    Agapito, Giuseppe; Milano, Marianna; Guzzi, Pietro Hiram; Cannataro, Mario

    2016-01-01

    Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.

  2. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

    PubMed

    Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E; Thomas, Paul D

    2016-01-01

    We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo. © The Author(s) 2016. Published by Oxford University Press.

  3. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

    PubMed Central

    Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

    2013-01-01

    The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284

  4. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout

    USDA-ARS?s Scientific Manuscript database

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complimented by transcriptome information that will enhance genome assembly and annotation. Previously, we reported a transcriptome reference sequence using a 19X coverage of Sanger and 454-pyrosequencing dat...

  5. A draft annotation and overview of the human genome

    PubMed Central

    Wright, Fred A; Lemon, William J; Zhao, Wei D; Sears, Russell; Zhuo, Degen; Wang, Jian-Ping; Yang, Hee-Yung; Baer, Troy; Stredney, Don; Spitzner, Joe; Stutz, Al; Krahe, Ralf; Yuan, Bo

    2001-01-01

    Background The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena. Results We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. Conclusions We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence. PMID:11516338

  6. Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout

    PubMed Central

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  7. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus.

    PubMed

    Carroll, Ronan K; Weiss, Andy; Broach, William H; Wiemels, Richard E; Mogen, Austin B; Rice, Kelly C; Shaw, Lindsey N

    2016-02-09

    In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. Despite a large number of studies identifying regulatory or small RNA (sRNA) genes in Staphylococcus aureus, their annotation is notably lacking in available genome files. In addition to this, there has been a considerable lack of cross-referencing in the wealth of studies identifying these elements, often leading to the same sRNA being identified multiple times and bearing multiple names. In this work, we have consolidated and curated known sRNA genes from the literature and mapped them to their position on the S. aureus genome, creating new genome annotation files. These files can now be used by the scientific community at large in experiments to search for previously undiscovered sRNA genes and to monitor sRNA gene expression by transcriptome sequencing (RNA-seq). We demonstrate this application, identifying 39 new sRNAs and studying their expression during S. aureus growth in human serum. Copyright © 2016 Carroll et al.

  8. Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs.

    PubMed

    Powell, Bradford C; Hutchison, Clyde A

    2006-01-19

    Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins. "Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene prediction. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect. Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes.

  9. Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs

    PubMed Central

    Powell, Bradford C; Hutchison, Clyde A

    2006-01-01

    Background Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins. Results "Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene predicion. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect. Conclusion Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes. PMID:16423288

  10. A Transcriptome Map of Actinobacillus pleuropneumoniae at Single-Nucleotide Resolution Using Deep RNA-Seq

    PubMed Central

    Su, Zhipeng; Zhu, Jiawen; Xu, Zhuofei; Xiao, Ran; Zhou, Rui; Li, Lu; Chen, Huanchun

    2016-01-01

    Actinobacillus pleuropneumoniae is the pathogen of porcine contagious pleuropneumoniae, a highly contagious respiratory disease of swine. Although the genome of A. pleuropneumoniae was sequenced several years ago, limited information is available on the genome-wide transcriptional analysis to accurately annotate the gene structures and regulatory elements. High-throughput RNA sequencing (RNA-seq) has been applied to study the transcriptional landscape of bacteria, which can efficiently and accurately identify gene expression regions and unknown transcriptional units, especially small non-coding RNAs (sRNAs), UTRs and regulatory regions. The aim of this study is to comprehensively analyze the transcriptome of A. pleuropneumoniae by RNA-seq in order to improve the existing genome annotation and promote our understanding of A. pleuropneumoniae gene structures and RNA-based regulation. In this study, we utilized RNA-seq to construct a single nucleotide resolution transcriptome map of A. pleuropneumoniae. More than 3.8 million high-quality reads (average length ~90 bp) from a cDNA library were generated and aligned to the reference genome. We identified 32 open reading frames encoding novel proteins that were mis-annotated in the previous genome annotations. The start sites for 35 genes based on the current genome annotation were corrected. Furthermore, 51 sRNAs in the A. pleuropneumoniae genome were discovered, of which 40 sRNAs were never reported in previous studies. The transcriptome map also enabled visualization of 5'- and 3'-UTR regions, in which contained 11 sRNAs. In addition, 351 operons covering 1230 genes throughout the whole genome were identified. The RNA-Seq based transcriptome map validated annotated genes and corrected annotations of open reading frames in the genome, and led to the identification of many functional elements (e.g. regions encoding novel proteins, non-coding sRNAs and operon structures). The transcriptional units described in this study provide a foundation for future studies concerning the gene functions and the transcriptional regulatory architectures of this pathogen. PMID:27018591

  11. sigReannot: an oligo-set re-annotation pipeline based on similarities with the Ensembl transcripts and Unigene clusters.

    PubMed

    Casel, Pierrot; Moreews, François; Lagarrigue, Sandrine; Klopp, Christophe

    2009-07-16

    Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location. The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published. SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.

  12. Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs

    PubMed Central

    Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C

    2003-01-01

    Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626

  13. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  14. Saccharomyces cerevisiae: gene annotation and genome variability, state of the art through comparative genomics.

    PubMed

    Louis, Ed

    2011-01-01

    In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.

  15. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    PubMed

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  16. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

    PubMed Central

    Li, Li; Stoeckert, Christian J.; Roos, David S.

    2003-01-01

    The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. PMID:12952885

  17. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC).

    PubMed

    Neuhaus, Klaus; Landstorfer, Richard; Fellner, Lea; Simon, Svenja; Schafferhans, Andrea; Goldberg, Tatyana; Marx, Harald; Ozoline, Olga N; Rost, Burkhard; Kuster, Bernhard; Keim, Daniel A; Scherer, Siegfried

    2016-02-24

    Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.

  18. Enriching regulatory networks by bootstrap learning using optimised GO-based gene similarity and gene links mined from PubMed abstracts

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Taylor, Ronald C.; Sanfilippo, Antonio P.; McDermott, Jason E.

    2011-02-18

    Transcriptional regulatory networks are being determined using “reverse engineering” methods that infer connections based on correlations in gene state. Corroboration of such networks through independent means such as evidence from the biomedical literature is desirable. Here, we explore a novel approach, a bootstrapping version of our previous Cross-Ontological Analytic method (XOA) that can be used for semi-automated annotation and verification of inferred regulatory connections, as well as for discovery of additional functional relationships between the genes. First, we use our annotation and network expansion method on a biological network learned entirely from the literature. We show how new relevant linksmore » between genes can be iteratively derived using a gene similarity measure based on the Gene Ontology that is optimized on the input network at each iteration. Second, we apply our method to annotation, verification, and expansion of a set of regulatory connections found by the Context Likelihood of Relatedness algorithm.« less

  19. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus

    PubMed Central

    Weiss, Andy; Broach, William H.; Wiemels, Richard E.; Mogen, Austin B.; Rice, Kelly C.

    2016-01-01

    ABSTRACT In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. PMID:26861020

  20. Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana

    PubMed Central

    Itoh, Takeshi; Tanaka, Tsuyoshi; Barrero, Roberto A.; Yamasaki, Chisato; Fujii, Yasuyuki; Hilton, Phillip B.; Antonio, Baltazar A.; Aono, Hideo; Apweiler, Rolf; Bruskiewich, Richard; Bureau, Thomas; Burr, Frances; Costa de Oliveira, Antonio; Fuks, Galina; Habara, Takuya; Haberer, Georg; Han, Bin; Harada, Erimi; Hiraki, Aiko T.; Hirochika, Hirohiko; Hoen, Douglas; Hokari, Hiroki; Hosokawa, Satomi; Hsing, Yue; Ikawa, Hiroshi; Ikeo, Kazuho; Imanishi, Tadashi; Ito, Yukiyo; Jaiswal, Pankaj; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Kawashima, Hiroaki; Khurana, Jitendra P.; Kikuchi, Shoshi; Komatsu, Setsuko; Koyanagi, Kanako O.; Kubooka, Hiromi; Lieberherr, Damien; Lin, Yao-Cheng; Lonsdale, David; Matsumoto, Takashi; Matsuya, Akihiro; McCombie, W. Richard; Messing, Joachim; Miyao, Akio; Mulder, Nicola; Nagamura, Yoshiaki; Nam, Jongmin; Namiki, Nobukazu; Numa, Hisataka; Nurimoto, Shin; O’Donovan, Claire; Ohyanagi, Hajime; Okido, Toshihisa; OOta, Satoshi; Osato, Naoki; Palmer, Lance E.; Quetier, Francis; Raghuvanshi, Saurabh; Saichi, Naomi; Sakai, Hiroaki; Sakai, Yasumichi; Sakata, Katsumi; Sakurai, Tetsuya; Sato, Fumihiko; Sato, Yoshiharu; Schoof, Heiko; Seki, Motoaki; Shibata, Michie; Shimizu, Yuji; Shinozaki, Kazuo; Shinso, Yuji; Singh, Nagendra K.; Smith-White, Brian; Takeda, Jun-ichi; Tanino, Motohiko; Tatusova, Tatiana; Thongjuea, Supat; Todokoro, Fusano; Tsugane, Mika; Tyagi, Akhilesh K.; Vanavichit, Apichart; Wang, Aihui; Wing, Rod A.; Yamaguchi, Kaori; Yamamoto, Mayu; Yamamoto, Naoyuki; Yu, Yeisoo; Zhang, Hao; Zhao, Qiang; Higo, Kenichi; Burr, Benjamin; Gojobori, Takashi; Sasaki, Takuji

    2007-01-01

    We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ∼32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene. PMID:17210932

  1. Improved methods and resources for paramecium genomics: transcription units, gene annotation and gene expression.

    PubMed

    Arnaiz, Olivier; Van Dijk, Erwin; Bétermier, Mireille; Lhuillier-Akakpo, Maoussi; de Vanssay, Augustin; Duharcourt, Sandra; Sallet, Erika; Gouzy, Jérôme; Sperling, Linda

    2017-06-26

    The 15 sibling species of the Paramecium aurelia cryptic species complex emerged after a whole genome duplication that occurred tens of millions of years ago. Given extensive knowledge of the genetics and epigenetics of Paramecium acquired over the last century, this species complex offers a uniquely powerful system to investigate the consequences of whole genome duplication in a unicellular eukaryote as well as the genetic and epigenetic mechanisms that drive speciation. High quality Paramecium gene models are important for research using this system. The major aim of the work reported here was to build an improved gene annotation pipeline for the Paramecium lineage. We generated oriented RNA-Seq transcriptome data across the sexual process of autogamy for the model species Paramecium tetraurelia. We determined, for the first time in a ciliate, candidate P. tetraurelia transcription start sites using an adapted Cap-Seq protocol. We developed TrUC, multi-threaded Perl software that in conjunction with TopHat mapping of RNA-Seq data to a reference genome, predicts transcription units for the annotation pipeline. We used EuGene software to combine annotation evidence. The high quality gene structural annotations obtained for P. tetraurelia were used as evidence to improve published annotations for 3 other Paramecium species. The RNA-Seq data were also used for differential gene expression analysis, providing a gene expression atlas that is more sensitive than the previously established microarray resource. We have developed a gene annotation pipeline tailored for the compact genomes and tiny introns of Paramecium species. A novel component of this pipeline, TrUC, predicts transcription units using Cap-Seq and oriented RNA-Seq data. TrUC could prove useful beyond Paramecium, especially in the case of high gene density. Accurate predictions of 3' and 5' UTR will be particularly valuable for studies of gene expression (e.g. nucleosome positioning, identification of cis regulatory motifs). The P. tetraurelia improved transcriptome resource, gene annotations for P. tetraurelia, P. biaurelia, P. sexaurelia and P. caudatum, and Paramecium-trained EuGene configuration are available through ParameciumDB ( http://paramecium.i2bc.paris-saclay.fr ). TrUC software is freely distributed under a GNU GPL v3 licence ( https://github.com/oarnaiz/TrUC ).

  2. Assembling the Setaria italica L. Beauv. genome into nine chromosomes and insights into regions affecting growth and drought tolerance

    PubMed Central

    Tsai, Kevin J.; Lu, Mei-Yeh Jade; Yang, Kai-Jung; Li, Mengyun; Teng, Yuchuan; Chen, Shihmay; Ku, Maurice S. B.; Li, Wen-Hsiung

    2016-01-01

    The diploid C4 plant foxtail millet (Setaria italica L. Beauv.) is an important crop in many parts of Africa and Asia for the vast consumption of its grain and ability to grow in harsh environments, but remains understudied in terms of complete genomic architecture. To date, there have been only two genome assembly and annotation efforts with neither assembly reaching over 86% of the estimated genome size. We have combined de novo assembly with custom reference-guided improvements on a popular cultivar of foxtail millet and have achieved a genome assembly of 477 Mbp in length, which represents over 97% of the estimated 490 Mbp. The assembly anchors over 98% of the predicted genes to the nine assembled nuclear chromosomes and contains more functional annotation gene models than previous assemblies. Our annotation has identified a large number of unique gene ontology terms related to metabolic activities, a region of chromosome 9 with several growth factor proteins, and regions syntenic with pearl millet or maize genomic regions that have been previously shown to affect growth. The new assembly and annotation for this important species can be used for detailed investigation and future innovations in growth for millet and other grains. PMID:27734962

  3. Assembling the Setaria italica L. Beauv. genome into nine chromosomes and insights into regions affecting growth and drought tolerance.

    PubMed

    Tsai, Kevin J; Lu, Mei-Yeh Jade; Yang, Kai-Jung; Li, Mengyun; Teng, Yuchuan; Chen, Shihmay; Ku, Maurice S B; Li, Wen-Hsiung

    2016-10-13

    The diploid C 4 plant foxtail millet (Setaria italica L. Beauv.) is an important crop in many parts of Africa and Asia for the vast consumption of its grain and ability to grow in harsh environments, but remains understudied in terms of complete genomic architecture. To date, there have been only two genome assembly and annotation efforts with neither assembly reaching over 86% of the estimated genome size. We have combined de novo assembly with custom reference-guided improvements on a popular cultivar of foxtail millet and have achieved a genome assembly of 477 Mbp in length, which represents over 97% of the estimated 490 Mbp. The assembly anchors over 98% of the predicted genes to the nine assembled nuclear chromosomes and contains more functional annotation gene models than previous assemblies. Our annotation has identified a large number of unique gene ontology terms related to metabolic activities, a region of chromosome 9 with several growth factor proteins, and regions syntenic with pearl millet or maize genomic regions that have been previously shown to affect growth. The new assembly and annotation for this important species can be used for detailed investigation and future innovations in growth for millet and other grains.

  4. Genome-wide annotation of the soybean WRKY family and functional characterization of genes involved in response to Phakopsora pachyrhizi infection.

    PubMed

    Bencke-Malato, Marta; Cabreira, Caroline; Wiebke-Strohm, Beatriz; Bücker-Neto, Lauro; Mancini, Estefania; Osorio, Marina B; Homrich, Milena S; Turchetto-Zolet, Andreia Carina; De Carvalho, Mayra C C G; Stolf, Renata; Weber, Ricardo L M; Westergaard, Gastón; Castagnaro, Atílio P; Abdelnoor, Ricardo V; Marcelino-Guimarães, Francismar C; Margis-Pinheiro, Márcia; Bodanese-Zanettini, Maria Helena

    2014-09-10

    Many previous studies have shown that soybean WRKY transcription factors are involved in the plant response to biotic and abiotic stresses. Phakopsora pachyrhizi is the causal agent of Asian Soybean Rust, one of the most important soybean diseases. There are evidences that WRKYs are involved in the resistance of some soybean genotypes against that fungus. The number of WRKY genes already annotated in soybean genome was underrepresented. In the present study, a genome-wide annotation of the soybean WRKY family was carried out and members involved in the response to P. pachyrhizi were identified. As a result of a soybean genomic databases search, 182 WRKY-encoding genes were annotated and 33 putative pseudogenes identified. Genes involved in the response to P. pachyrhizi infection were identified using superSAGE, RNA-Seq of microdissected lesions and microarray experiments. Seventy-five genes were differentially expressed during fungal infection. The expression of eight WRKY genes was validated by RT-qPCR. The expression of these genes in a resistant genotype was earlier and/or stronger compared with a susceptible genotype in response to P. pachyrhizi infection. Soybean somatic embryos were transformed in order to overexpress or silence WRKY genes. Embryos overexpressing a WRKY gene were obtained, but they were unable to convert into plants. When infected with P. pachyrhizi, the leaves of the silenced transgenic line showed a higher number of lesions than the wild-type plants. The present study reports a genome-wide annotation of soybean WRKY family. The participation of some members in response to P. pachyrhizi infection was demonstrated. The results contribute to the elucidation of gene function and suggest the manipulation of WRKYs as a strategy to increase fungal resistance in soybean plants.

  5. Deep developmental transcriptome sequencing uncovers numerous new genes and enhances gene annotation in the sponge Amphimedon queenslandica.

    PubMed

    Fernandez-Valverde, Selene L; Calcino, Andrew D; Degnan, Bernard M

    2015-05-15

    The demosponge Amphimedon queenslandica is amongst the few early-branching metazoans with an assembled and annotated draft genome, making it an important species in the study of the origin and early evolution of animals. Current gene models in this species are largely based on in silico predictions and low coverage expressed sequence tag (EST) evidence. Amphimedon queenslandica protein-coding gene models are improved using deep RNA-Seq data from four developmental stages and CEL-Seq data from 82 developmental samples. Over 86% of previously predicted genes are retained in the new gene models, although 24% have additional exons; there is also a marked increase in the total number of annotated 3' and 5' untranslated regions (UTRs). Importantly, these new developmental transcriptome data reveal numerous previously unannotated protein-coding genes in the Amphimedon genome, increasing the total gene number by 25%, from 30,060 to 40,122. In general, Amphimedon genes have introns that are markedly smaller than those in other animals and most of the alternatively spliced genes in Amphimedon undergo intron-retention; exon-skipping is the least common mode of alternative splicing. Finally, in addition to canonical polyadenylation signal sequences, Amphimedon genes are enriched in a number of unique AT-rich motifs in their 3' UTRs. The inclusion of developmental transcriptome data has substantially improved the structure and composition of protein-coding gene models in Amphimedon queenslandica, providing a more accurate and comprehensive set of genes for functional and comparative studies. These improvements reveal the Amphimedon genome is comprised of a remarkably high number of tightly packed genes. These genes have small introns and there is pervasive intron retention amongst alternatively spliced transcripts. These aspects of the sponge genome are more similar unicellular opisthokont genomes than to other animal genomes.

  6. Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  7. Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics.

    PubMed

    de Angelis, Martin Hrabě; Nicholson, George; Selloum, Mohammed; White, Jacqui; Morgan, Hugh; Ramirez-Solis, Ramiro; Sorg, Tania; Wells, Sara; Fuchs, Helmut; Fray, Martin; Adams, David J; Adams, Niels C; Adler, Thure; Aguilar-Pimentel, Antonio; Ali-Hadji, Dalila; Amann, Gregory; André, Philippe; Atkins, Sarah; Auburtin, Aurelie; Ayadi, Abdel; Becker, Julien; Becker, Lore; Bedu, Elodie; Bekeredjian, Raffi; Birling, Marie-Christine; Blake, Andrew; Bottomley, Joanna; Bowl, Mike; Brault, Véronique; Busch, Dirk H; Bussell, James N; Calzada-Wack, Julia; Cater, Heather; Champy, Marie-France; Charles, Philippe; Chevalier, Claire; Chiani, Francesco; Codner, Gemma F; Combe, Roy; Cox, Roger; Dalloneau, Emilie; Dierich, André; Di Fenza, Armida; Doe, Brendan; Duchon, Arnaud; Eickelberg, Oliver; Esapa, Chris T; El Fertak, Lahcen; Feigel, Tanja; Emelyanova, Irina; Estabel, Jeanne; Favor, Jack; Flenniken, Ann; Gambadoro, Alessia; Garrett, Lilian; Gates, Hilary; Gerdin, Anna-Karin; Gkoutos, George; Greenaway, Simon; Glasl, Lisa; Goetz, Patrice; Da Cruz, Isabelle Goncalves; Götz, Alexander; Graw, Jochen; Guimond, Alain; Hans, Wolfgang; Hicks, Geoff; Hölter, Sabine M; Höfler, Heinz; Hancock, John M; Hoehndorf, Robert; Hough, Tertius; Houghton, Richard; Hurt, Anja; Ivandic, Boris; Jacobs, Hughes; Jacquot, Sylvie; Jones, Nora; Karp, Natasha A; Katus, Hugo A; Kitchen, Sharon; Klein-Rodewald, Tanja; Klingenspor, Martin; Klopstock, Thomas; Lalanne, Valerie; Leblanc, Sophie; Lengger, Christoph; le Marchand, Elise; Ludwig, Tonia; Lux, Aline; McKerlie, Colin; Maier, Holger; Mandel, Jean-Louis; Marschall, Susan; Mark, Manuel; Melvin, David G; Meziane, Hamid; Micklich, Kateryna; Mittelhauser, Christophe; Monassier, Laurent; Moulaert, David; Muller, Stéphanie; Naton, Beatrix; Neff, Frauke; Nolan, Patrick M; Nutter, Lauryl Mj; Ollert, Markus; Pavlovic, Guillaume; Pellegata, Natalia S; Peter, Emilie; Petit-Demoulière, Benoit; Pickard, Amanda; Podrini, Christine; Potter, Paul; Pouilly, Laurent; Puk, Oliver; Richardson, David; Rousseau, Stephane; Quintanilla-Fend, Leticia; Quwailid, Mohamed M; Racz, Ildiko; Rathkolb, Birgit; Riet, Fabrice; Rossant, Janet; Roux, Michel; Rozman, Jan; Ryder, Ed; Salisbury, Jennifer; Santos, Luis; Schäble, Karl-Heinz; Schiller, Evelyn; Schrewe, Anja; Schulz, Holger; Steinkamp, Ralf; Simon, Michelle; Stewart, Michelle; Stöger, Claudia; Stöger, Tobias; Sun, Minxuan; Sunter, David; Teboul, Lydia; Tilly, Isabelle; Tocchini-Valentini, Glauco P; Tost, Monica; Treise, Irina; Vasseur, Laurent; Velot, Emilie; Vogt-Weisenhorn, Daniela; Wagner, Christelle; Walling, Alison; Weber, Bruno; Wendling, Olivia; Westerberg, Henrik; Willershäuser, Monja; Wolf, Eckhard; Wolter, Anne; Wood, Joe; Wurst, Wolfgang; Yildirim, Ali Önder; Zeh, Ramona; Zimmer, Andreas; Zimprich, Annemarie; Holmes, Chris; Steel, Karen P; Herault, Yann; Gailus-Durner, Valérie; Mallon, Ann-Marie; Brown, Steve Dm

    2015-09-01

    The function of the majority of genes in the mouse and human genomes remains unknown. The mouse embryonic stem cell knockout resource provides a basis for the characterization of relationships between genes and phenotypes. The EUMODIC consortium developed and validated robust methodologies for the broad-based phenotyping of knockouts through a pipeline comprising 20 disease-oriented platforms. We developed new statistical methods for pipeline design and data analysis aimed at detecting reproducible phenotypes with high power. We acquired phenotype data from 449 mutant alleles, representing 320 unique genes, of which half had no previous functional annotation. We captured data from over 27,000 mice, finding that 83% of the mutant lines are phenodeviant, with 65% demonstrating pleiotropy. Surprisingly, we found significant differences in phenotype annotation according to zygosity. New phenotypes were uncovered for many genes with previously unknown function, providing a powerful basis for hypothesis generation and further investigation in diverse systems.

  8. A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis

    PubMed Central

    2011-01-01

    Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). PMID:21668950

  9. GARNET--gene set analysis with exploration of annotation relations.

    PubMed

    Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu

    2011-02-15

    Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).

  10. Visualizing conserved gene location across microbe genomes

    NASA Astrophysics Data System (ADS)

    Shaw, Chris D.

    2009-01-01

    This paper introduces an analysis-based zoomable visualization technique for displaying the location of genes across many related species of microbes. The purpose of this visualizatiuon is to enable a biologist to examine the layout of genes in the organism of interest with respect to the gene organization of related organisms. During the genomic annotation process, the ability to observe gene organization in common with previously annotated genomes can help a biologist better confirm the structure and function of newly analyzed microbe DNA sequences. We have developed a visualization and analysis tool that enables the biologist to observe and examine gene organization among genomes, in the context of the primary sequence of interest. This paper describes the visualization and analysis steps, and presents a case study using a number of Rickettsia genomes.

  11. RATT: Rapid Annotation Transfer Tool

    PubMed Central

    Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew

    2011-01-01

    Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991

  12. Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset

    PubMed Central

    2012-01-01

    Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071

  13. Transcriptome profile of a bovine respiratory disease pathogen: Mannheimia haemolytica PHL213

    PubMed Central

    2012-01-01

    Background Computational methods for structural gene annotation have propelled gene discovery but face certain drawbacks with regards to prokaryotic genome annotation. Identification of transcriptional start sites, demarcating overlapping gene boundaries, and identifying regulatory elements such as small RNA are not accurate using these approaches. In this study, we re-visit the structural annotation of Mannheimia haemolytica PHL213, a bovine respiratory disease pathogen. M. haemolytica is one of the causative agents of bovine respiratory disease that results in about $3 billion annual losses to the cattle industry. We used RNA-Seq and analyzed the data using freely-available computational methods and resources. The aim was to identify previously unannotated regions of the genome using RNA-Seq based expression profile to complement the existing annotation of this pathogen. Results Using the Illumina Genome Analyzer, we generated 9,055,826 reads (average length ~76 bp) and aligned them to the reference genome using Bowtie. The transcribed regions were analyzed using SAMTOOLS and custom Perl scripts in conjunction with BLAST searches and available gene annotation information. The single nucleotide resolution map enabled the identification of 14 novel protein coding regions as well as 44 potential novel sRNA. The basal transcription profile revealed that 2,506 of the 2,837 annotated regions were expressed in vitro, at 95.25% coverage, representing all broad functional gene categories in the genome. The expression profile also helped identify 518 potential operon structures involving 1,086 co-expressed pairs. We also identified 11 proteins with mutated/alternate start codons. Conclusions The application of RNA-Seq based transcriptome profiling to structural gene annotation helped correct existing annotation errors and identify potential novel protein coding regions and sRNA. We used computational tools to predict regulatory elements such as promoters and terminators associated with the novel expressed regions for further characterization of these novel functional elements. Our study complements the existing structural annotation of Mannheimia haemolytica PHL213 based on experimental evidence. Given the role of sRNA in virulence gene regulation and stress response, potential novel sRNA described in this study can form the framework for future studies to determine the role of sRNA, if any, in M. haemolytica pathogenesis. PMID:23046475

  14. Extending bicluster analysis to annotate unclassified ORFs and predict novel functional modules using expression data

    PubMed Central

    Bryan, Kenneth; Cunningham, Pádraig

    2008-01-01

    Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786

  15. ConsPred: a rule-based (re-)annotation framework for prokaryotic genomes.

    PubMed

    Weinmaier, Thomas; Platzer, Alexander; Frank, Jeroen; Hellinger, Hans-Jörg; Tischler, Patrick; Rattei, Thomas

    2016-11-01

    The rapidly growing number of available prokaryotic genome sequences requires fully automated and high-quality software solutions for their initial and re-annotation. Here we present ConsPred, a prokaryotic genome annotation framework that performs intrinsic gene predictions, homology searches, predictions of non-coding genes as well as CRISPR repeats and integrates all evidence into a consensus annotation. ConsPred achieves comprehensive, high-quality annotations based on rules and priorities, similar to decision-making in manual curation and avoids conflicting predictions. Parameters controlling the annotation process are configurable by the user. ConsPred has been used in the institutions of the authors for longer than 5 years and can easily be extended and adapted to specific needs. The ConsPred algorithm for producing a consensus from the varying scores of multiple gene prediction programs approaches manual curation in accuracy. Its rule-based approach for choosing final predictions avoids overriding previous manual curations. ConsPred is implemented in Java, Perl and Shell and is freely available under the Creative Commons license as a stand-alone in-house pipeline or as an Amazon Machine Image for cloud computing, see https://sourceforge.net/projects/conspred/. thomas.rattei@univie.ac.atSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

    PubMed Central

    Aubourg, Sébastien; Brunaud, Véronique; Bruyère, Clémence; Cock, Mark; Cooke, Richard; Cottet, Annick; Couloux, Arnaud; Déhais, Patrice; Deléage, Gilbert; Duclert, Aymeric; Echeverria, Manuel; Eschbach, Aimée; Falconet, Denis; Filippi, Ghislain; Gaspin, Christine; Geourjon, Christophe; Grienenberger, Jean-Michel; Houlné, Guy; Jamet, Elisabeth; Lechauve, Frédéric; Leleu, Olivier; Leroy, Philippe; Mache, Régis; Meyer, Christian; Nedjari, Hafed; Negrutiu, Ioan; Orsini, Valérie; Peyretaillade, Eric; Pommier, Cyril; Raes, Jeroen; Risler, Jean-Loup; Rivière, Stéphane; Rombauts, Stéphane; Rouzé, Pierre; Schneider, Michel; Schwob, Philippe; Small, Ian; Soumayet-Kampetenga, Ghislain; Stankovski, Darko; Toffano, Claire; Tognolli, Michael; Caboche, Michel; Lecharny, Alain

    2005-01-01

    Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot. PMID:15608279

  17. NoGOA: predicting noisy GO annotations using evidences and sparse representation.

    PubMed

    Yu, Guoxian; Lu, Chang; Wang, Jun

    2017-07-21

    Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .

  18. An integrative approach to inferring biologically meaningful gene modules.

    PubMed

    Cho, Ji-Hoon; Wang, Kai; Galas, David J

    2011-07-26

    The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association. We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions. The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.

  19. PRGdb 3.0: a comprehensive platform for prediction and analysis of plant disease resistance genes.

    PubMed

    Osuna-Cruz, Cristina M; Paytuvi-Gallart, Andreu; Di Donato, Antimo; Sundesha, Vicky; Andolfo, Giuseppe; Aiese Cigliano, Riccardo; Sanseverino, Walter; Ercolano, Maria R

    2018-01-04

    The Plant Resistance Genes database (PRGdb; http://prgdb.org) has been redesigned with a new user interface, new sections, new tools and new data for genetic improvement, allowing easy access not only to the plant science research community but also to breeders who want to improve plant disease resistance. The home page offers an overview of easy-to-read search boxes that streamline data queries and directly show plant species for which data from candidate or cloned genes have been collected. Bulk data files and curated resistance gene annotations are made available for each plant species hosted. The new Gene Model view offers detailed information on each cloned resistance gene structure to highlight shared attributes with other genes. PRGdb 3.0 offers 153 reference resistance genes and 177 072 annotated candidate Pathogen Receptor Genes (PRGs). Compared to the previous release, the number of putative genes has been increased from 106 to 177 K from 76 sequenced Viridiplantae and algae genomes. The DRAGO 2 tool, which automatically annotates and predicts (PRGs) from DNA and amino acid with high accuracy and sensitivity, has been added. BLAST search has been implemented to offer users the opportunity to annotate and compare their own sequences. The improved section on plant diseases displays useful information linked to genes and genomes to connect complementary data and better address specific needs. Through, a revised and enlarged collection of data, the development of new tools and a renewed portal, PRGdb 3.0 engages the plant science community in developing a consensus plan to improve knowledge and strategies to fight diseases that afflict main crops and other plants. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing.

    PubMed

    Gupta, Vikas; Estrada, April D; Blakley, Ivory; Reid, Rob; Patel, Ketan; Meyer, Mason D; Andersen, Stig Uggerhøj; Brown, Allan F; Lila, Mary Ann; Loraine, Ann E

    2015-01-01

    Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable the breeding of berry varieties with enhanced health benefits. Toward this end, we annotated a previously sequenced draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up-regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. Homology-based annotation using Blast2GO and InterPro assigned Gene Ontology terms to around 15,000 genes. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators. Analysis of RNA-seq alignments identified developmentally regulated alternative splicing, promoter use, and 3' end formation. We report genome sequence, gene models, functional annotations, and RNA-Seq expression data that provide an important new resource enabling high throughput studies in blueberry.

  1. Transcriptome characterization for genome annotation and functional genomics in Theobroma cacao

    USDA-ARS?s Scientific Manuscript database

    Evidence from leaf transcriptome sequencing using two technology platforms, in combination with protein homology and trained ab initio predictions, previously enabled us to build 35,000 gene models in T. cacao (www.cacaogenomedb.org). Here we review the contribution of each data type to cacao gene a...

  2. Candidate Gene Identification with SNP Marker-Based Fine Mapping of Anthracnose Resistance Gene Co-4 in Common Bean.

    PubMed

    Burt, Andrew J; William, H Manilal; Perry, Gregory; Khanal, Raja; Pauls, K Peter; Kelly, James D; Navabi, Alireza

    2015-01-01

    Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co-4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co-4 is localized. Three SCAR markers with known linkage to Co-4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK-4 loci found in previous studies. It is possible that the Co-4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases.

  3. Candidate Gene Identification with SNP Marker-Based Fine Mapping of Anthracnose Resistance Gene Co-4 in Common Bean

    PubMed Central

    Burt, Andrew J.; William, H. Manilal; Perry, Gregory; Khanal, Raja; Pauls, K. Peter; Kelly, James D.; Navabi, Alireza

    2015-01-01

    Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co–4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co–4 is localized. Three SCAR markers with known linkage to Co–4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK–4 loci found in previous studies. It is possible that the Co–4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases. PMID:26431031

  4. Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project

    PubMed Central

    Horton, Roger; Gibson, Richard; Coggill, Penny; Miretti, Marcos; Allcock, Richard J.; Almeida, Jeff; Forbes, Simon; Gilbert, James G. R.; Halls, Karen; Harrow, Jennifer L.; Hart, Elizabeth; Howe, Kevin; Jackson, David K.; Palmer, Sophie; Roberts, Anne N.; Sims, Sarah; Stewart, C. Andrew; Traherne, James A.; Trevanion, Steve; Wilming, Laurens; Rogers, Jane; de Jong, Pieter J.; Elliott, John F.; Sawcer, Stephen; Todd, John A.; Trowsdale, John

    2008-01-01

    The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine. PMID:18193213

  5. Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach

    PubMed Central

    2014-01-01

    Background Induced resistance (IR) can be part of a sustainable plant protection strategy against important plant diseases. β-aminobutyric acid (BABA) can induce resistance in a wide range of plants against several types of pathogens, including potato infected with Phytophthora infestans. However, the molecular mechanisms behind this are unclear and seem to be dependent on the system studied. To elucidate the defence responses activated by BABA in potato, a genome-wide transcript microarray analysis in combination with label-free quantitative proteomics analysis of the apoplast secretome were performed two days after treatment of the leaf canopy with BABA at two concentrations, 1 and 10 mM. Results Over 5000 transcripts were differentially expressed and over 90 secretome proteins changed in abundance indicating a massive activation of defence mechanisms with 10 mM BABA, the concentration effective against late blight disease. To aid analysis, we present a more comprehensive functional annotation of the microarray probes and gene models by retrieving information from orthologous gene families across 26 sequenced plant genomes. The new annotation provided GO terms to 8616 previously un-annotated probes. Conclusions BABA at 10 mM affected several processes related to plant hormones and amino acid metabolism. A major accumulation of PR proteins was also evident, and in the mevalonate pathway, genes involved in sterol biosynthesis were down-regulated, whereas several enzymes involved in the sesquiterpene phytoalexin biosynthesis were up-regulated. Interestingly, abscisic acid (ABA) responsive genes were not as clearly regulated by BABA in potato as previously reported in Arabidopsis. Together these findings provide candidates and markers for improved resistance in potato, one of the most important crops in the world. PMID:24773703

  6. Cross-organism learning method to discover new gene functionalities.

    PubMed

    Domeniconi, Giacomo; Masseroli, Marco; Moro, Gianluca; Pinoli, Pietro

    2016-04-01

    Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  7. Similar Ratios of Introns to Intergenic Sequence across Animal Genomes

    PubMed Central

    Wörheide, Gert

    2017-01-01

    Abstract One central goal of genome biology is to understand how the usage of the genome differs between organisms. Our knowledge of genome composition, needed for downstream inferences, is critically dependent on gene annotations, yet problems associated with gene annotation and assembly errors are usually ignored in comparative genomics. Here, we analyze the genomes of 68 species across 12 animal phyla and some single-cell eukaryotes for general trends in genome composition and transcription, taking into account problems of gene annotation. We show that, regardless of genome size, the ratio of introns to intergenic sequence is comparable across essentially all animals, with nearly all deviations dominated by increased intergenic sequence. Genomes of model organisms have ratios much closer to 1:1, suggesting that the majority of published genomes of nonmodel organisms are underannotated and consequently omit substantial numbers of genes, with likely negative impact on evolutionary interpretations. Finally, our results also indicate that most animals transcribe half or more of their genomes arguing against differences in genome usage between animal groups, and also suggesting that the transcribed portion is more dependent on genome size than previously thought. PMID:28633296

  8. Fractal Clustering and Knowledge-driven Validation Assessment for Gene Expression Profiling.

    PubMed

    Wang, Lu-Yong; Balasubramanian, Ammaiappan; Chakraborty, Amit; Comaniciu, Dorin

    2005-01-01

    DNA microarray experiments generate a substantial amount of information about the global gene expression. Gene expression profiles can be represented as points in multi-dimensional space. It is essential to identify relevant groups of genes in biomedical research. Clustering is helpful in pattern recognition in gene expression profiles. A number of clustering techniques have been introduced. However, these traditional methods mainly utilize shape-based assumption or some distance metric to cluster the points in multi-dimension linear Euclidean space. Their results shows poor consistence with the functional annotation of genes in previous validation study. From a novel different perspective, we propose fractal clustering method to cluster genes using intrinsic (fractal) dimension from modern geometry. This method clusters points in such a way that points in the same clusters are more self-affine among themselves than to the points in other clusters. We assess this method using annotation-based validation assessment for gene clusters. It shows that this method is superior in identifying functional related gene groups than other traditional methods.

  9. Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1

    PubMed Central

    Berardini, Tanya Z.; Mundodi, Suparna; Reiser, Leonore; Huala, Eva; Garcia-Hernandez, Margarita; Zhang, Peifen; Mueller, Lukas A.; Yoon, Jungwoon; Doyle, Aisling; Lander, Gabriel; Moseyko, Nick; Yoo, Danny; Xu, Iris; Zoeckler, Brandon; Montoya, Mary; Miller, Neil; Weems, Dan; Rhee, Seung Y.

    2004-01-01

    Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species. PMID:15173566

  10. Annotation of the Asian Citrus Psyllid Genome Reveals a Reduced Innate Immune System

    PubMed Central

    Arp, Alex P.; Hunter, Wayne B.; Pelz-Stelinski, Kirsten S.

    2016-01-01

    Citrus production worldwide is currently facing significant losses due to citrus greening disease, also known as Huanglongbing. The citrus greening bacteria, Candidatus Liberibacter asiaticus (CLas), is a persistent propagative pathogen transmitted by the Asian citrus psyllid, Diaphorina citri Kuwayama (Hemiptera: Liviidae). Hemipterans characterized to date lack a number of insect immune genes, including those associated with the Imd pathway targeting Gram-negative bacteria. The D. citri draft genome was used to characterize the immune defense genes present in D. citri. Predicted mRNAs identified by screening the published D. citri annotated draft genome were manually searched using a custom database of immune genes from previously annotated insect genomes. Toll and JAK/STAT pathways, general defense genes Dual oxidase, Nitric oxide synthase, prophenoloxidase, and cellular immune defense genes were present in D. citri. In contrast, D. citri lacked genes for the Imd pathway, most antimicrobial peptides, 1,3-β-glucan recognition proteins (GNBPs), and complete peptidoglycan recognition proteins. These data suggest that D. citri has a reduced immune capability similar to that observed in A. pisum, P. humanus, and R. prolixus. The absence of immune system genes from the D. citri genome may facilitate CLas infections, and is possibly compensated for by their relationship with their microbial endosymbionts. PMID:27965582

  11. Improving the genome annotation of the acarbose producer Actinoplanes sp. SE50/110 by sequencing enriched 5'-ends of primary transcripts.

    PubMed

    Schwientek, Patrick; Neshat, Armin; Kalinowski, Jörn; Klein, Andreas; Rückert, Christian; Schneiker-Bekel, Susanne; Wendler, Sergej; Stoye, Jens; Pühler, Alfred

    2014-11-20

    Actinoplanes sp. SE50/110 is the producer of the alpha-glucosidase inhibitor acarbose, which is an economically relevant and potent drug in the treatment of type-2 diabetes mellitus. In this study, we present the detection of transcription start sites on this genome by sequencing enriched 5'-ends of primary transcripts. Altogether, 1427 putative transcription start sites were initially identified. With help of the annotated genome sequence, 661 transcription start sites were found to belong to the leader region of protein-coding genes with the surprising result that roughly 20% of these genes rank among the class of leaderless transcripts. Next, conserved promoter motifs were identified for protein-coding genes with and without leader sequences. The mapped transcription start sites were finally used to improve the annotation of the Actinoplanes sp. SE50/110 genome sequence. Concerning protein-coding genes, 41 translation start sites were corrected and 9 novel protein-coding genes could be identified. In addition to this, 122 previously undetermined non-coding RNA (ncRNA) genes of Actinoplanes sp. SE50/110 were defined. Focusing on antisense transcription start sites located within coding genes or their leader sequences, it was discovered that 96 of those ncRNA genes belong to the class of antisense RNA (asRNA) genes. The remaining 26 ncRNA genes were found outside of known protein-coding genes. Four chosen examples of prominent ncRNA genes, namely the transfer messenger RNA gene ssrA, the ribonuclease P class A RNA gene rnpB, the cobalamin riboswitch RNA gene cobRS, and the selenocysteine-specific tRNA gene selC, are presented in more detail. This study demonstrates that sequencing of enriched 5'-ends of primary transcripts and the identification of transcription start sites are valuable tools for advanced genome annotation of Actinoplanes sp. SE50/110 and most probably also for other bacteria. Copyright © 2014 Elsevier B.V. All rights reserved.

  12. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.

    PubMed

    Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer

    2017-01-04

    The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. BC4GO: a full-text corpus for the BioCreative IV GO Task

    USDA-ARS?s Scientific Manuscript database

    Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database (MOD) groups. Due to its manual nature, this task is time-consuming and labor-intensive, and thus considered one of the bottlenecks in literature curation. There have been many previous attempts a...

  14. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kolker, Eugene

    Our project focused primarily on analysis of different types of data produced by global high-throughput technologies, data integration of gene annotation, and gene and protein expression information, as well as on getting a better functional annotation of Shewanella genes. Specifically, four of our numerous major activities and achievements include the development of: statistical models for identification and expression proteomics, superior to currently available approaches (including our own earlier ones); approaches to improve gene annotations on the whole-organism scale; standards for annotation, transcriptomics and proteomics approaches; and generalized approaches for data integration of gene annotation, gene and protein expression information.

  15. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  16. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. Conclusions This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig’s adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response. PMID:23676093

  17. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  18. A guide to best practices for Gene Ontology (GO) manual annotation

    PubMed Central

    Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael

    2013-01-01

    The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463

  19. An integrative approach to inferring biologically meaningful gene modules

    PubMed Central

    2011-01-01

    Background The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association. Results We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions. Conclusions The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level. PMID:21791051

  20. Gene Ontology annotations at SGD: new data sources and annotation methods

    PubMed Central

    Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.; Nash, Robert S.; Oughtred, Rose; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Zhu, Kathy K.; Dolinski, Kara; Botstein, David; Cherry, J. Michael

    2008-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current. PMID:17982175

  1. A curated catalog of canine and equine keratin genes

    PubMed Central

    Pujar, Shashikant; McGarvey, Kelly M.; Welle, Monika; Galichet, Arnaud; Müller, Eliane J.; Pruitt, Kim D.; Leeb, Tosso

    2017-01-01

    Keratins represent a large protein family with essential structural and functional roles in epithelial cells of skin, hair follicles, and other organs. During evolution the genes encoding keratins have undergone multiple rounds of duplication and humans have two clusters with a total of 55 functional keratin genes in their genomes. Due to the high similarity between different keratin paralogs and species-specific differences in gene content, the currently available keratin gene annotation in species with draft genome assemblies such as dog and horse is still imperfect. We compared the National Center for Biotechnology Information (NCBI) (dog annotation release 103, horse annotation release 101) and Ensembl (release 87) gene predictions for the canine and equine keratin gene clusters to RNA-seq data that were generated from adult skin of five dogs and two horses and from adult hair follicle tissue of one dog. Taking into consideration the knowledge on the conserved exon/intron structure of keratin genes, we annotated 61 putatively functional keratin genes in both the dog and horse, respectively. Subsequently, curators in the RefSeq group at NCBI reviewed their annotation of keratin genes in the dog and horse genomes (Annotation Release 104 and Annotation Release 102, respectively) and updated annotation and gene nomenclature of several keratin genes. The updates are now available in the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene). PMID:28846680

  2. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  3. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community

    PubMed Central

    Hosmani, Prashant S.; Villalobos-Ayala, Krystal; Miller, Sherry; Shippy, Teresa; Flores, Mirella; Rosendale, Andrew; Cordola, Chris; Bell, Tracey; Mann, Hannah; DeAvila, Gabe; DeAvila, Daniel; Moore, Zachary; Buller, Kyle; Ciolkevich, Kathryn; Nandyal, Samantha; Mahoney, Robert; Van Voorhis, Joshua; Dunlevy, Megan; Farrow, David; Hunter, David; Morgan, Taylar; Shore, Kayla; Guzman, Victoria; Izsak, Allison; Dixon, Danielle E.; Cridge, Andrew; Cano, Liliana; Cao, Xiaolong; Jiang, Haobo; Leng, Nan; Johnson, Shannon; Cantarel, Brandi L.; Richards, Stephen; English, Adam; Shatters, Robert G.; Childers, Chris; Chen, Mei-Ju; Hunter, Wayne; Cilia, Michelle; Mueller, Lukas A.; Munoz-Torres, Monica; Nelson, David; Poelchau, Monica F.; Benoit, Joshua B.; Wiersma-Koch, Helen; D’Elia, Tom; Brown, Susan J.

    2017-01-01

    Abstract The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. Database URL: https://citrusgreening.org/ PMID:29220441

  4. Complete genome sequence of Metallosphaera cuprina, a metal sulfide-oxidizing archaeon from a hot spring.

    PubMed

    Liu, Li-Jun; You, Xiao-Yan; Zheng, Huajun; Wang, Shengyue; Jiang, Cheng-Ying; Liu, Shuang-Jiang

    2011-07-01

    The genome of the metal sulfide-oxidizing, thermoacidophilic strain Metallosphaera cuprina Ar-4 has been completely sequenced and annotated. Originally isolated from a sulfuric hot spring, strain Ar-4 grows optimally at 65°C and a pH of 3.5. The M. cuprina genome has a 1,840,348-bp circular chromosome (2,029 open reading frames [ORFs]) and is 16% smaller than the previously sequenced Metallosphaera sedula genome. Compared to the M. sedula genome, there are no counterpart genes in the M. cuprina genome for about 480 ORFs in the M. sedula genome, of which 243 ORFs are annotated as hypothetical protein genes. Still, there are 233 ORFs uniquely occurring in M. cuprina. Genome annotation supports that M. cuprina lives a facultative life on CO(2) and organics and obtains energy from oxidation of sulfidic ores and reduced inorganic sulfuric compounds.

  5. Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers.

    PubMed

    de Miguel, Marina; de Maria, Nuria; Guevara, M Angeles; Diaz, Luis; Sáez-Laguna, Enrique; Sánchez-Gómez, David; Chancerel, Emilie; Aranda, Ismael; Collada, Carmen; Plomion, Christophe; Cabezas, José-Antonio; Cervera, María-Teresa

    2012-10-04

    Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest.

  6. Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers

    PubMed Central

    2012-01-01

    Background Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. Results We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. Conclusions This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest. PMID:23036012

  7. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.

    PubMed

    Law, MeiYee; Childs, Kevin L; Campbell, Michael S; Stein, Joshua C; Olson, Andrew J; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. © 2015 American Society of Plant Biologists. All Rights Reserved.

  8. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    PubMed Central

    Thibaud-Nissen, Françoise; Campbell, Matthew; Hamilton, John P; Zhu, Wei; Buell, C Robin

    2007-01-01

    Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at . PMID:17961238

  9. GeneTools--application for functional annotation and statistical hypothesis testing.

    PubMed

    Beisvag, Vidar; Jünge, Frode K R; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Laegreid, Astrid

    2006-10-24

    Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no

  10. AGORA : Organellar genome annotation from the amino acid and nucleotide references.

    PubMed

    Jung, Jaehee; Kim, Jong Im; Jeong, Young-Sik; Yi, Gangman

    2018-03-29

    Next-generation sequencing (NGS) technologies have led to the accumulation of highthroughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals.We have developed a web application AGORA for the fast, user-friendly, and improved annotations of organellar genomes. AGORA annotates genes based on a BLAST-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. AGORA can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon-intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence, and visualization of gene map by OGDRAW. Users can freely use the software, and the accessible URL is https://bigdata.dongguk.edu/gene_project/AGORA/.The main module of the tool is implemented by the python and php, and the web page is built by the HTML and CSS to support all browsers. gangman@dongguk.edu.

  11. The Listeria monocytogenes strain 10403S BioCyc database

    PubMed Central

    Orsi, Renato H.; Bergholz, Teresa M.; Wiedmann, Martin; Boor, Kathryn J.

    2015-01-01

    Listeria monocytogenes is a food-borne pathogen of humans and other animals. The striking ability to survive several stresses usually used for food preservation makes L. monocytogenes one of the biggest concerns to the food industry, while the high mortality of listeriosis in specific groups of humans makes it a great concern for public health. Previous studies have shown that a regulatory network involving alternative sigma (σ) factors and transcription factors is pivotal to stress survival. However, few studies have evaluated at the metabolic networks controlled by these regulatory mechanisms. The L. monocytogenes BioCyc database uses the strain 10403S as a model. Computer-generated initial annotation for all genes also allowed for identification, annotation and display of predicted reactions and pathways carried out by a single cell. Further ongoing manual curation based on published data as well as database mining for selected genes allowed the more refined annotation of functions, which, in turn, allowed for annotation of new pathways and fine-tuning of previously defined pathways to more L. monocytogenes-specific pathways. Using RNA-Seq data, several transcription start sites and promoter regions were mapped to the 10403S genome and annotated within the database. Additionally, the identification of promoter regions and a comprehensive review of available literature allowed the annotation of several regulatory interactions involving σ factors and transcription factors. The L. monocytogenes 10403S BioCyc database is a new resource for researchers studying Listeria and related organisms. It allows users to (i) have a comprehensive view of all reactions and pathways predicted to take place within the cell in the cellular overview, as well as to (ii) upload their own data, such as differential expression data, to visualize the data in the scope of predicted pathways and regulatory networks and to carry on enrichment analyses using several different annotations available within the database. Database URL: http://biocyc.org/organism-summary?object=10403S_RAST PMID:25819074

  12. The Listeria monocytogenes strain 10403S BioCyc database.

    PubMed

    Orsi, Renato H; Bergholz, Teresa M; Wiedmann, Martin; Boor, Kathryn J

    2015-01-01

    Listeria monocytogenes is a food-borne pathogen of humans and other animals. The striking ability to survive several stresses usually used for food preservation makes L. monocytogenes one of the biggest concerns to the food industry, while the high mortality of listeriosis in specific groups of humans makes it a great concern for public health. Previous studies have shown that a regulatory network involving alternative sigma (σ) factors and transcription factors is pivotal to stress survival. However, few studies have evaluated at the metabolic networks controlled by these regulatory mechanisms. The L. monocytogenes BioCyc database uses the strain 10403S as a model. Computer-generated initial annotation for all genes also allowed for identification, annotation and display of predicted reactions and pathways carried out by a single cell. Further ongoing manual curation based on published data as well as database mining for selected genes allowed the more refined annotation of functions, which, in turn, allowed for annotation of new pathways and fine-tuning of previously defined pathways to more L. monocytogenes-specific pathways. Using RNA-Seq data, several transcription start sites and promoter regions were mapped to the 10403S genome and annotated within the database. Additionally, the identification of promoter regions and a comprehensive review of available literature allowed the annotation of several regulatory interactions involving σ factors and transcription factors. The L. monocytogenes 10403S BioCyc database is a new resource for researchers studying Listeria and related organisms. It allows users to (i) have a comprehensive view of all reactions and pathways predicted to take place within the cell in the cellular overview, as well as to (ii) upload their own data, such as differential expression data, to visualize the data in the scope of predicted pathways and regulatory networks and to carry on enrichment analyses using several different annotations available within the database. © The Author(s) 2015. Published by Oxford University Press.

  13. The pig X and Y Chromosomes: structure, sequence, and evolution

    PubMed Central

    Skinner, Benjamin M.; Sargent, Carole A.; Churcher, Carol; Hunt, Toby; Herrero, Javier; Loveland, Jane E.; Dunn, Matt; Louzada, Sandra; Fu, Beiyuan; Chow, William; Gilbert, James; Austin-Guest, Siobhan; Beal, Kathryn; Carvalho-Silva, Denise; Cheng, William; Gordon, Daria; Grafham, Darren; Hardy, Matt; Harley, Jo; Hauser, Heidi; Howden, Philip; Howe, Kerstin; Lachani, Kim; Ellis, Peter J.I.; Kelly, Daniel; Kerry, Giselle; Kerwin, James; Ng, Bee Ling; Threadgold, Glen; Wileman, Thomas; Wood, Jonathan M.D.; Yang, Fengtang; Harrow, Jen; Affara, Nabeel A.; Tyler-Smith, Chris

    2016-01-01

    We have generated an improved assembly and gene annotation of the pig X Chromosome, and a first draft assembly of the pig Y Chromosome, by sequencing BAC and fosmid clones from Duroc animals and incorporating information from optical mapping and fiber-FISH. The X Chromosome carries 1033 annotated genes, 690 of which are protein coding. Gene order closely matches that found in primates (including humans) and carnivores (including cats and dogs), which is inferred to be ancestral. Nevertheless, several protein-coding genes present on the human X Chromosome were absent from the pig, and 38 pig-specific X-chromosomal genes were annotated, 22 of which were olfactory receptors. The pig Y-specific Chromosome sequence generated here comprises 30 megabases (Mb). A 15-Mb subset of this sequence was assembled, revealing two clusters of male-specific low copy number genes, separated by an ampliconic region including the HSFY gene family, which together make up most of the short arm. Both clusters contain palindromes with high sequence identity, presumably maintained by gene conversion. Many of the ancestral X-related genes previously reported in at least one mammalian Y Chromosome are represented either as active genes or partial sequences. This sequencing project has allowed us to identify genes—both single copy and amplified—on the pig Y Chromosome, to compare the pig X and Y Chromosomes for homologous sequences, and thereby to reveal mechanisms underlying pig X and Y Chromosome evolution. PMID:26560630

  14. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values. PMID:18325104

  15. Discovering gene annotations in biomedical text databases.

    PubMed

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-03-06

    Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.

  16. The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations.

    PubMed

    Cerqueira, Gustavo C; Arnaud, Martha B; Inglis, Diane O; Skrzypek, Marek S; Binkley, Gail; Simison, Matt; Miyasato, Stuart R; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R

    2014-01-01

    The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome.

  17. Similar Ratios of Introns to Intergenic Sequence across Animal Genomes.

    PubMed

    Francis, Warren R; Wörheide, Gert

    2017-06-01

    One central goal of genome biology is to understand how the usage of the genome differs between organisms. Our knowledge of genome composition, needed for downstream inferences, is critically dependent on gene annotations, yet problems associated with gene annotation and assembly errors are usually ignored in comparative genomics. Here, we analyze the genomes of 68 species across 12 animal phyla and some single-cell eukaryotes for general trends in genome composition and transcription, taking into account problems of gene annotation. We show that, regardless of genome size, the ratio of introns to intergenic sequence is comparable across essentially all animals, with nearly all deviations dominated by increased intergenic sequence. Genomes of model organisms have ratios much closer to 1:1, suggesting that the majority of published genomes of nonmodel organisms are underannotated and consequently omit substantial numbers of genes, with likely negative impact on evolutionary interpretations. Finally, our results also indicate that most animals transcribe half or more of their genomes arguing against differences in genome usage between animal groups, and also suggesting that the transcribed portion is more dependent on genome size than previously thought. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  18. GeneRIF indexing: sentence selection based on machine learning.

    PubMed

    Jimeno-Yepes, Antonio J; Sticco, J Caitlin; Mork, James G; Aronson, Alan R

    2013-05-31

    A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.

  19. JGI Plant Genomics Gene Annotation Pipeline

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward thismore » aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.« less

  20. Improved maize reference genome with single-molecule technologies.

    PubMed

    Jiao, Yinping; Peluso, Paul; Shi, Jinghua; Liang, Tiffany; Stitzer, Michelle C; Wang, Bo; Campbell, Michael S; Stein, Joshua C; Wei, Xuehong; Chin, Chen-Shan; Guill, Katherine; Regulski, Michael; Kumari, Sunita; Olson, Andrew; Gent, Jonathan; Schneider, Kevin L; Wolfgruber, Thomas K; May, Michael R; Springer, Nathan M; Antoniou, Eric; McCombie, W Richard; Presting, Gernot G; McMullen, Michael; Ross-Ibarra, Jeffrey; Dawe, R Kelly; Hastie, Alex; Rank, David R; Ware, Doreen

    2017-06-22

    Complete and accurate reference genomes and annotations provide fundamental tools for characterization of genetic and functional variation. These resources facilitate the determination of biological processes and support translation of research findings into improved and sustainable agricultural technologies. Many reference genomes for crop plants have been generated over the past decade, but these genomes are often fragmented and missing complex repeat regions. Here we report the assembly and annotation of a reference genome of maize, a genetic and agricultural model species, using single-molecule real-time sequencing and high-resolution optical mapping. Relative to the previous reference genome, our assembly features a 52-fold increase in contig length and notable improvements in the assembly of intergenic spaces and centromeres. Characterization of the repetitive portion of the genome revealed more than 130,000 intact transposable elements, allowing us to identify transposable element lineage expansions that are unique to maize. Gene annotations were updated using 111,000 full-length transcripts obtained by single-molecule real-time sequencing. In addition, comparative optical mapping of two other inbred maize lines revealed a prevalence of deletions in regions of low gene density and maize lineage-specific genes.

  1. Breaking the 1000-gene barrier for Mimivirus using ultra-deep genome and transcriptome sequencing.

    PubMed

    Legendre, Matthieu; Santini, Sébastien; Rico, Alain; Abergel, Chantal; Claverie, Jean-Michel

    2011-03-04

    Mimivirus, a giant dsDNA virus infecting Acanthamoeba, is the prototype of the mimiviridae family, the latest addition to the family of the nucleocytoplasmic large DNA viruses (NCLDVs). Its 1.2 Mb-genome was initially predicted to encode 917 genes. A subsequent RNA-Seq analysis precisely mapped many transcript boundaries and identified 75 new genes. We now report a much deeper analysis using the SOLiD™ technology combining RNA-Seq of the Mimivirus transcriptome during the infectious cycle (202.4 Million reads), and a complete genome re-sequencing (45.3 Million reads). This study corrected the genome sequence and identified several single nucleotide polymorphisms. Our results also provided clear evidence of previously overlooked transcription units, including an important RNA polymerase subunit distantly related to Euryarchea homologues. The total Mimivirus gene count is now 1018, 11% greater than the original annotation. This study highlights the huge progress brought about by ultra-deep sequencing for the comprehensive annotation of virus genomes, opening the door to a complete one-nucleotide resolution level description of their transcriptional activity, and to the realistic modeling of the viral genome expression at the ultimate molecular level. This work also illustrates the need to go beyond bioinformatics-only approaches for the annotation of short protein and non-coding genes in viral genomes.

  2. Methodology for the inference of gene function from phenotype data.

    PubMed

    Ascensao, Joao A; Dolan, Mary E; Hill, David P; Blake, Judith A

    2014-12-12

    Biomedical ontologies are increasingly instrumental in the advancement of biological research primarily through their use to efficiently consolidate large amounts of data into structured, accessible sets. However, ontology development and usage can be hampered by the segregation of knowledge by domain that occurs due to independent development and use of the ontologies. The ability to infer data associated with one ontology to data associated with another ontology would prove useful in expanding information content and scope. We here focus on relating two ontologies: the Gene Ontology (GO), which encodes canonical gene function, and the Mammalian Phenotype Ontology (MP), which describes non-canonical phenotypes, using statistical methods to suggest GO functional annotations from existing MP phenotype annotations. This work is in contrast to previous studies that have focused on inferring gene function from phenotype primarily through lexical or semantic similarity measures. We have designed and tested a set of algorithms that represents a novel methodology to define rules for predicting gene function by examining the emergent structure and relationships between the gene functions and phenotypes rather than inspecting the terms semantically. The algorithms inspect relationships among multiple phenotype terms to deduce if there are cases where they all arise from a single gene function. We apply this methodology to data about genes in the laboratory mouse that are formally represented in the Mouse Genome Informatics (MGI) resource. From the data, 7444 rule instances were generated from five generalized rules, resulting in 4818 unique GO functional predictions for 1796 genes. We show that our method is capable of inferring high-quality functional annotations from curated phenotype data. As well as creating inferred annotations, our method has the potential to allow for the elucidation of unforeseen, biologically significant associations between gene function and phenotypes that would be overlooked by a semantics-based approach. Future work will include the implementation of the described algorithms for a variety of other model organism databases, taking full advantage of the abundance of available high quality curated data.

  3. Recognition of Protein-coding Genes Based on Z-curve Algorithms

    PubMed Central

    -Biao Guo, Feng; Lin, Yan; -Ling Chen, Ling

    2014-01-01

    Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation. PMID:24822027

  4. Co-complex protein membership evaluation using Maximum Entropy on GO ontology and InterPro annotation.

    PubMed

    Armean, Irina M; Lilley, Kathryn S; Trotter, Matthew W B; Pilkington, Nicholas C V; Holden, Sean B

    2018-06-01

    Protein-protein interactions (PPI) play a crucial role in our understanding of protein function and biological processes. The standardization and recording of experimental findings is increasingly stored in ontologies, with the Gene Ontology (GO) being one of the most successful projects. Several PPI evaluation algorithms have been based on the application of probabilistic frameworks or machine learning algorithms to GO properties. Here, we introduce a new training set design and machine learning based approach that combines dependent heterogeneous protein annotations from the entire ontology to evaluate putative co-complex protein interactions determined by empirical studies. PPI annotations are built combinatorically using corresponding GO terms and InterPro annotation. We use a S.cerevisiae high-confidence complex dataset as a positive training set. A series of classifiers based on Maximum Entropy and support vector machines (SVMs), each with a composite counterpart algorithm, are trained on a series of training sets. These achieve a high performance area under the ROC curve of ≤0.97, outperforming go2ppi-a previously established prediction tool for protein-protein interactions (PPI) based on Gene Ontology (GO) annotations. https://github.com/ima23/maxent-ppi. sbh11@cl.cam.ac.uk. Supplementary data are available at Bioinformatics online.

  5. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments

    PubMed Central

    Haas, Brian J; Salzberg, Steven L; Zhu, Wei; Pertea, Mihaela; Allen, Jonathan E; Orvis, Joshua; White, Owen; Buell, C Robin; Wortman, Jennifer R

    2008-01-01

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation. PMID:18190707

  6. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  7. RNA Sequencing-Based Genome Reannotation of the Dermatophyte Arthroderma benhamiae and Characterization of Its Secretome and Whole Gene Expression Profile during Infection

    PubMed Central

    De Coi, Niccolò; Feuermann, Marc; Schmid-Siegert, Emanuel; Băguţ, Elena-Tatiana; Mignon, Bernard; Waridel, Patrice; Peter, Corinne; Pradervand, Sylvain

    2016-01-01

    ABSTRACT Dermatophytes are the most common agents of superficial mycoses in humans and animals. The aim of the present investigation was to systematically identify the extracellular, possibly secreted, proteins that are putative virulence factors and antigenic molecules of dermatophytes. A complete gene expression profile of Arthroderma benhamiae was obtained during infection of its natural host (guinea pig) using RNA sequencing (RNA-seq) technology. This profile was completed with those of the fungus cultivated in vitro in two media containing either keratin or soy meal protein as the sole source of nitrogen and in Sabouraud medium. More than 60% of transcripts deduced from RNA-seq data differ from those previously deposited for A. benhamiae. Using these RNA-seq data along with an automatic gene annotation procedure, followed by manual curation, we produced a new annotation of the A. benhamiae genome. This annotation comprised 7,405 coding sequences (CDSs), among which only 2,662 were identical to the currently available annotation, 383 were newly identified, and 15 secreted proteins were manually corrected. The expression profile of genes encoding proteins with a signal peptide in infected guinea pigs was found to be very different from that during in vitro growth when using keratin as the substrate. Especially, the sets of the 12 most highly expressed genes encoding proteases with a signal sequence had only the putative vacuolar aspartic protease gene PEP2 in common, during infection and in keratin medium. The most upregulated gene encoding a secreted protease during infection was that encoding subtilisin SUB6, which is a known major allergen in the related dermatophyte Trichophyton rubrum. IMPORTANCE Dermatophytoses (ringworm, jock itch, athlete’s foot, and nail infections) are the most common fungal infections, but their virulence mechanisms are poorly understood. Combining transcriptomic data obtained from growth under various culture conditions with data obtained during infection led to a significantly improved genome annotation. About 65% of the protein-encoding genes predicted with our protocol did not match the existing annotation for A. benhamiae. Comparing gene expression during infection on guinea pigs with keratin degradation in vitro, which is supposed to mimic the host environment, revealed the critical importance of using real in vivo conditions for investigating virulence mechanisms. The analysis of genes expressed in vivo, encoding cell surface and secreted proteins, particularly proteases, led to the identification of new allergen and virulence factor candidates. PMID:27822542

  8. RNA Sequencing-Based Genome Reannotation of the Dermatophyte Arthroderma benhamiae and Characterization of Its Secretome and Whole Gene Expression Profile during Infection.

    PubMed

    Tran, Van Du T; De Coi, Niccolò; Feuermann, Marc; Schmid-Siegert, Emanuel; Băguţ, Elena-Tatiana; Mignon, Bernard; Waridel, Patrice; Peter, Corinne; Pradervand, Sylvain; Pagni, Marco; Monod, Michel

    2016-01-01

    Dermatophytes are the most common agents of superficial mycoses in humans and animals. The aim of the present investigation was to systematically identify the extracellular, possibly secreted, proteins that are putative virulence factors and antigenic molecules of dermatophytes. A complete gene expression profile of Arthroderma benhamiae was obtained during infection of its natural host (guinea pig) using RNA sequencing (RNA-seq) technology. This profile was completed with those of the fungus cultivated in vitro in two media containing either keratin or soy meal protein as the sole source of nitrogen and in Sabouraud medium. More than 60% of transcripts deduced from RNA-seq data differ from those previously deposited for A. benhamiae . Using these RNA-seq data along with an automatic gene annotation procedure, followed by manual curation, we produced a new annotation of the A. benhamiae genome. This annotation comprised 7,405 coding sequences (CDSs), among which only 2,662 were identical to the currently available annotation, 383 were newly identified, and 15 secreted proteins were manually corrected. The expression profile of genes encoding proteins with a signal peptide in infected guinea pigs was found to be very different from that during in vitro growth when using keratin as the substrate. Especially, the sets of the 12 most highly expressed genes encoding proteases with a signal sequence had only the putative vacuolar aspartic protease gene PEP2 in common, during infection and in keratin medium. The most upregulated gene encoding a secreted protease during infection was that encoding subtilisin SUB6, which is a known major allergen in the related dermatophyte Trichophyton rubrum . IMPORTANCE Dermatophytoses (ringworm, jock itch, athlete's foot, and nail infections) are the most common fungal infections, but their virulence mechanisms are poorly understood. Combining transcriptomic data obtained from growth under various culture conditions with data obtained during infection led to a significantly improved genome annotation. About 65% of the protein-encoding genes predicted with our protocol did not match the existing annotation for A. benhamiae . Comparing gene expression during infection on guinea pigs with keratin degradation in vitro , which is supposed to mimic the host environment, revealed the critical importance of using real in vivo conditions for investigating virulence mechanisms. The analysis of genes expressed in vivo , encoding cell surface and secreted proteins, particularly proteases, led to the identification of new allergen and virulence factor candidates.

  9. Matrix factorization-based data fusion for gene function prediction in baker's yeast and slime mold.

    PubMed

    Zitnik, Marinka; Zupan, Blaž

    2014-01-01

    The development of effective methods for the characterization of gene functions that are able to combine diverse data sources in a sound and easily-extendible way is an important goal in computational biology. We have previously developed a general matrix factorization-based data fusion approach for gene function prediction. In this manuscript, we show that this data fusion approach can be applied to gene function prediction and that it can fuse various heterogeneous data sources, such as gene expression profiles, known protein annotations, interaction and literature data. The fusion is achieved by simultaneous matrix tri-factorization that shares matrix factors between sources. We demonstrate the effectiveness of the approach by evaluating its performance on predicting ontological annotations in slime mold D. discoideum and on recognizing proteins of baker's yeast S. cerevisiae that participate in the ribosome or are located in the cell membrane. Our approach achieves predictive performance comparable to that of the state-of-the-art kernel-based data fusion, but requires fewer data preprocessing steps.

  10. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes.

    PubMed

    Lowe, Todd M; Chan, Patricia P

    2016-07-08

    High-throughput genome sequencing continues to grow the need for rapid, accurate genome annotation and tRNA genes constitute the largest family of essential, ever-present non-coding RNA genes. Newly developed tRNAscan-SE 2.0 has advanced the state-of-the-art methodology in tRNA gene detection and functional prediction, captured by rich new content of the companion Genomic tRNA Database. Previously, web-server tRNA detection was isolated from knowledge of existing tRNAs and their annotation. In this update of the tRNAscan-SE On-line resource, we tie together improvements in tRNA classification with greatly enhanced biological context via dynamically generated links between web server search results, the most relevant genes in the GtRNAdb and interactive, rich genome context provided by UCSC genome browsers. The tRNAscan-SE On-line web server can be accessed at http://trna.ucsc.edu/tRNAscan-SE/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. BioGPS and MyGene.info: organizing online, gene-centric information.

    PubMed

    Wu, Chunlei; Macleod, Ian; Su, Andrew I

    2013-01-01

    Fast-evolving technologies have enabled researchers to easily generate data at genome scale, and using these technologies to compare biological states typically results in a list of candidate genes. Researchers are then faced with the daunting task of prioritizing these candidate genes for follow-up studies. There are hundreds, possibly even thousands, of web-based gene annotation resources available, but it quickly becomes impractical to manually access and review all of these sites for each gene in a candidate gene list. BioGPS (http://biogps.org) was created as a centralized gene portal for aggregating distributed gene annotation resources, emphasizing community extensibility and user customizability. BioGPS serves as a convenient tool for users to access known gene-centric resources, as well as a mechanism to discover new resources that were previously unknown to the user. This article describes updates to BioGPS made after its initial release in 2008. We summarize recent additions of features and data, as well as the robust user activity that underlies this community intelligence application. Finally, we describe MyGene.info (http://mygene.info) and related web services that provide programmatic access to BioGPS.

  12. Large-Scale Sequencing of Two Regions in Human Chromosome 7q22: Analysis of 650 kb of Genomic Sequence around the EPO and CUTL1 Loci Reveals 17 Genes

    PubMed Central

    Glöckner, Gernot; Scherer, Stephen; Schattevoy, Ruben; Boright, Andrew; Weber, Jacqueline; Tsui, Lap-Chee; Rosenthal, André

    1998-01-01

    We have sequenced and annotated two genomic regions located in the Giemsa negative band q22 of human chromosome 7. The first region defined by the erythropoietin (EPO) locus is 228 kb in length and contains 13 genes. Whereas 3 genes (GNB2, EPO, PCOLCE) were known previously on the mRNA level, we have been able to identify 10 novel genes using a newly developed automatic annotation tool RUMMAGE-DP, which comprises >26 different programs mainly for exon prediction, homology searches, and compositional and repeat analysis. For precise annotation we have also resequenced ESTs identified to the region and assembled them to build large cDNAs. In addition, we have investigated the differential splicing of genes. Using these tools we annotated 4 of the 10 genes as a zonadhesin, a transferrin homolog, a nucleoporin-like gene, and an actin gene. Two genes showed weak similarity to an insulin-like receptor and a neuronal protein with a leucine-rich amino-terminal domain. Four predicted genes (CDS1–CDS4) CDS that have been confirmed on the mRNA level showed no similarity to known proteins and a potential function could not be assigned. The second region in 7q22 defined by the CUTL1 (CCAAT displacement protein and its splice variant) locus is 416 kb in length and contains three known genes, including PMSL12, APS, CUTL1, and a novel gene (CDS5). The CUTL1 locus, consisting of two splice variants (CDP and CASP), occupies >300 kb. Based on the G,C profile an isochore switch can be defined between the CUTL1 gene and the APS and PMSL12 genes. [Clones 37G3, 164c7, and 235f8 are deposited in GenBank under accession no. AF053356; clone 123e15, accession no. AF024533; 186d2, accession no. AF024534; 46f6, accession no. AF006752; 50h2, accession no. AF047825; and 76h2, accession no. AF030453] PMID:9799793

  13. Interspecific and intraspecific gene variability in a 1-Mb region containing the highest density of NBS-LRR genes found in the melon genome.

    PubMed

    González, Víctor M; Aventín, Núria; Centeno, Emilio; Puigdomènech, Pere

    2014-12-17

    Plant NBS-LRR -resistance genes tend to be found in clusters, which have been shown to be hot spots of genome variability. In melon, half of the 81 predicted NBS-LRR genes group in nine clusters, and a 1 Mb region on linkage group V contains the highest density of R-genes and presence/absence gene polymorphisms found in the melon genome. This region is known to contain the locus of Vat, an agronomically important gene that confers resistance to aphids. However, the presence of duplications makes the sequencing and annotation of R-gene clusters difficult, usually resulting in multi-gapped sequences with higher than average errors. A 1-Mb sequence that contains the largest NBS-LRR gene cluster found in melon was improved using a strategy that combines Illumina paired-end mapping and PCR-based gap closing. Unknown sequence was decreased by 70% while about 3,000 SNPs and small indels were corrected. As a result, the annotations of 18 of a total of 23 NBS-LRR genes found in this region were modified, including additional coding sequences, amino acid changes, correction of splicing boundaries, or fussion of ORFs in common transcription units. A phylogeny analysis of the R-genes and their comparison with syntenic sequences in other cucurbits point to a pattern of local gene amplifications since the diversification of cucurbits from other families, and through speciation within the family. A candidate Vat gene is proposed based on the sequence similarity between a reported Vat gene from a Korean melon cultivar and a sequence fragment previously absent in the unrefined sequence. A sequence refinement strategy allowed substantial improvement of a 1 Mb fragment of the melon genome and the re-annotation of the largest cluster of NBS-LRR gene homologues found in melon. Analysis of the cluster revealed that resistance genes have been produced by sequence duplication in adjacent genome locations since the divergence of cucurbits from other close families, and through the process of speciation within the family a candidate Vat gene was also identified using sequence previously unavailable, which demonstrates the advantages of genome assembly refinements when analyzing complex regions such as those containing clusters of highly similar genes.

  14. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGES

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; ...

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  15. GONUTS: the Gene Ontology Normal Usage Tracking System

    PubMed Central

    Renfro, Daniel P.; McIntosh, Brenley K.; Venkatraman, Anand; Siegele, Deborah A.; Hu, James C.

    2012-01-01

    The Gene Ontology Normal Usage Tracking System (GONUTS) is a community-based browser and usage guide for Gene Ontology (GO) terms and a community system for general GO annotation of proteins. GONUTS uses wiki technology to allow registered users to share and edit notes on the use of each term in GO, and to contribute annotations for specific genes of interest. By providing a site for generation of third-party documentation at the granularity of individual terms, GONUTS complements the official documentation of the Gene Ontology Consortium. To provide examples for community users, GONUTS displays the complete GO annotations from seven model organisms: Saccharomyces cerevisiae, Dictyostelium discoideum, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus and Arabidopsis thaliana. To support community annotation, GONUTS allows automated creation of gene pages for gene products in UniProt. GONUTS will improve the consistency of annotation efforts across genome projects, and should be useful in training new annotators and consumers in the production of GO annotations and the use of GO terms. GONUTS can be accessed at http://gowiki.tamu.edu. The source code for generating the content of GONUTS is available upon request. PMID:22110029

  16. Stop codon readthrough generates a C-terminally extended variant of the human vitamin D receptor with reduced calcitriol response

    PubMed Central

    Loughran, Gary; Jungreis, Irwin; Tzani, Ioanna; Power, Michael; Dmitriev, Ruslan I.; Ivanov, Ivaylo P.; Kellis, Manolis; Atkins, John F.

    2018-01-01

    Although stop codon readthrough is used extensively by viruses to expand their gene expression, verified instances of mammalian readthrough have only recently been uncovered by systems biology and comparative genomics approaches. Previously, our analysis of conserved protein coding signatures that extend beyond annotated stop codons predicted stop codon readthrough of several mammalian genes, all of which have been validated experimentally. Four mRNAs display highly efficient stop codon readthrough, and these mRNAs have a UGA stop codon immediately followed by CUAG (UGA_CUAG) that is conserved throughout vertebrates. Extending on the identification of this readthrough motif, we here investigated stop codon readthrough, using tissue culture reporter assays, for all previously untested human genes containing UGA_CUAG. The readthrough efficiency of the annotated stop codon for the sequence encoding vitamin D receptor (VDR) was 6.7%. It was the highest of those tested but all showed notable levels of readthrough. The VDR is a member of the nuclear receptor superfamily of ligand-inducible transcription factors, and it binds its major ligand, calcitriol, via its C-terminal ligand-binding domain. Readthrough of the annotated VDR mRNA results in a 67 amino acid–long C-terminal extension that generates a VDR proteoform named VDRx. VDRx may form homodimers and heterodimers with VDR but, compared with VDR, VDRx displayed a reduced transcriptional response to calcitriol even in the presence of its partner retinoid X receptor. PMID:29386352

  17. Comparative Life Cycle Transcriptomics Revises Leishmania mexicana Genome Annotation and Links a Chromosome Duplication with Parasitism of Vertebrates

    PubMed Central

    Fiebig, Michael; Kelly, Steven; Gluenz, Eva

    2015-01-01

    Leishmania spp. are protozoan parasites that have two principal life cycle stages: the motile promastigote forms that live in the alimentary tract of the sandfly and the amastigote forms, which are adapted to survive and replicate in the harsh conditions of the phagolysosome of mammalian macrophages. Here, we used Illumina sequencing of poly-A selected RNA to characterise and compare the transcriptomes of L. mexicana promastigotes, axenic amastigotes and intracellular amastigotes. These data allowed the production of the first transcriptome evidence-based annotation of gene models for this species, including genome-wide mapping of trans-splice sites and poly-A addition sites. The revised genome annotation encompassed 9,169 protein-coding genes including 936 novel genes as well as modifications to previously existing gene models. Comparative analysis of gene expression across promastigote and amastigote forms revealed that 3,832 genes are differentially expressed between promastigotes and intracellular amastigotes. A large proportion of genes that were downregulated during differentiation to amastigotes were associated with the function of the motile flagellum. In contrast, those genes that were upregulated included cell surface proteins, transporters, peptidases and many uncharacterized genes, including 293 of the 936 novel genes. Genome-wide distribution analysis of the differentially expressed genes revealed that the tetraploid chromosome 30 is highly enriched for genes that were upregulated in amastigotes, providing the first evidence of a link between this whole chromosome duplication event and adaptation to the vertebrate host in this group. Peptide evidence for 42 proteins encoded by novel transcripts supports the idea of an as yet uncharacterised set of small proteins in Leishmania spp. with possible implications for host-pathogen interactions. PMID:26452044

  18. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

    PubMed

    Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P

    2015-03-11

    The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

  19. BEACON: automated tool for Bacterial GEnome Annotation ComparisON.

    PubMed

    Kalkatawi, Manal; Alam, Intikhab; Bajic, Vladimir B

    2015-08-18

    Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON's utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27%, while the number of genes without any function assignment is reduced. We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/ .

  20. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  1. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  2. miRBase: integrating microRNA annotation and deep-sequencing data.

    PubMed

    Kozomara, Ana; Griffiths-Jones, Sam

    2011-01-01

    miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15,000 microRNA gene loci in over 140 species, and over 17,000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/.

  3. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project.

    PubMed

    Aggarwal, Gautam; Worthey, E A; McDonagh, Paul D; Myler, Peter J

    2003-06-07

    Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.

  4. Metagenomic gene annotation by a homology-independent approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Froula, Jeff; Zhang, Tao; Salmeen, Annette

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMERmore » but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.« less

  5. Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome

    PubMed Central

    Hücker, Sarah M.; Ardern, Zachary; Goldberg, Tatyana; Schafferhans, Andrea; Bernhofer, Michael; Vestergaard, Gisle; Nelson, Chase W.; Schloter, Michael; Rost, Burkhard; Scherer, Siegfried

    2017-01-01

    In the past, short protein-coding genes were often disregarded by genome annotation pipelines. Transcriptome sequencing (RNAseq) signals outside of annotated genes have usually been interpreted to indicate either ncRNA or pervasive transcription. Therefore, in addition to the transcriptome, the translatome (RIBOseq) of the enteric pathogen Escherichia coli O157:H7 strain Sakai was determined at two optimal growth conditions and a severe stress condition combining low temperature and high osmotic pressure. All intergenic open reading frames potentially encoding a protein of ≥ 30 amino acids were investigated with regard to coverage by transcription and translation signals and their translatability expressed by the ribosomal coverage value. This led to discovery of 465 unique, putative novel genes not yet annotated in this E. coli strain, which are evenly distributed over both DNA strands of the genome. For 255 of the novel genes, annotated homologs in other bacteria were found, and a machine-learning algorithm, trained on small protein-coding E. coli genes, predicted that 89% of these translated open reading frames represent bona fide genes. The remaining 210 putative novel genes without annotated homologs were compared to the 255 novel genes with homologs and to 250 short annotated genes of this E. coli strain. All three groups turned out to be similar with respect to their translatability distribution, fractions of differentially regulated genes, secondary structure composition, and the distribution of evolutionary constraint, suggesting that both novel groups represent legitimate genes. However, the machine-learning algorithm only recognized a small fraction of the 210 genes without annotated homologs. It is possible that these genes represent a novel group of genes, which have unusual features dissimilar to the genes of the machine-learning algorithm training set. PMID:28902868

  6. Mutant phenotypes for thousands of bacterial genes of unknown function

    DOE PAGES

    Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan; ...

    2018-05-16

    One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less

  7. Mutant phenotypes for thousands of bacterial genes of unknown function

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan

    One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less

  8. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    PubMed Central

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  9. Analysis of disease-associated objects at the Rat Genome Database

    PubMed Central

    Wang, Shur-Jen; Laulederkind, Stanley J. F.; Hayman, G. T.; Smith, Jennifer R.; Petri, Victoria; Lowry, Timothy F.; Nigam, Rajni; Dwinell, Melinda R.; Worthey, Elizabeth A.; Munzenmaier, Diane H.; Shimoyama, Mary; Jacob, Howard J.

    2013-01-01

    The Rat Genome Database (RGD) is the premier resource for genetic, genomic and phenotype data for the laboratory rat, Rattus norvegicus. In addition to organizing biological data from rats, the RGD team focuses on manual curation of gene–disease associations for rat, human and mouse. In this work, we have analyzed disease-associated strains, quantitative trait loci (QTL) and genes from rats. These disease objects form the basis for seven disease portals. Among disease portals, the cardiovascular disease and obesity/metabolic syndrome portals have the highest number of rat strains and QTL. These two portals share 398 rat QTL, and these shared QTL are highly concentrated on rat chromosomes 1 and 2. For disease-associated genes, we performed gene ontology (GO) enrichment analysis across portals using RatMine enrichment widgets. Fifteen GO terms, five from each GO aspect, were selected to profile enrichment patterns of each portal. Of the selected biological process (BP) terms, ‘regulation of programmed cell death’ was the top enriched term across all disease portals except in the obesity/metabolic syndrome portal where ‘lipid metabolic process’ was the most enriched term. ‘Cytosol’ and ‘nucleus’ were common cellular component (CC) annotations for disease genes, but only the cancer portal genes were highly enriched with ‘nucleus’ annotations. Similar enrichment patterns were observed in a parallel analysis using the DAVID functional annotation tool. The relationship between the preselected 15 GO terms and disease terms was examined reciprocally by retrieving rat genes annotated with these preselected terms. The individual GO term–annotated gene list showed enrichment in physiologically related diseases. For example, the ‘regulation of blood pressure’ genes were enriched with cardiovascular disease annotations, and the ‘lipid metabolic process’ genes with obesity annotations. Furthermore, we were able to enhance enrichment of neurological diseases by combining ‘G-protein coupled receptor binding’ annotated genes with ‘protein kinase binding’ annotated genes. Database URL: http://rgd.mcw.edu PMID:23794737

  10. Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50)

    PubMed Central

    Iamartino, Daniela; Pruitt, Kim D; Sonstegard, Tad; Smith, Timothy P L; Low, Wai Yee; Biagini, Tommaso; Bomba, Lorenzo; Capomaccio, Stefano; Castiglioni, Bianca; Coletta, Angelo; Corrado, Federica; Ferré, Fabrizio; Iannuzzi, Leopoldo; Lawley, Cynthia; Macciotta, Nicolò; McClure, Matthew; Mancini, Giordano; Matassino, Donato; Mazza, Raffaele; Milanesi, Marco; Moioli, Bianca; Morandi, Nicola; Ramunno, Luigi; Peretti, Vincenzo; Pilla, Fabio; Ramelli, Paola; Schroeder, Steven; Strozzi, Francesco; Thibaud-Nissen, Francoise; Zicarelli, Luigi; Ajmone-Marsan, Paolo; Valentini, Alessio; Chillemi, Giovanni; Zimin, Aleksey

    2017-01-01

    Abstract Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2n = 50) and the swamp (2n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1. PMID:29048578

  11. Novel transcriptome assembly and improved annotation of the whiteleg shrimp (Litopenaeus vannamei), a dominant crustacean in global seafood mariculture.

    PubMed

    Ghaffari, Noushin; Sanchez-Flores, Alejandro; Doan, Ryan; Garcia-Orozco, Karina D; Chen, Patricia L; Ochoa-Leyva, Adrian; Lopez-Zavala, Alonso A; Carrasco, J Salvador; Hong, Chris; Brieba, Luis G; Rudiño-Piñera, Enrique; Blood, Philip D; Sawyer, Jason E; Johnson, Charles D; Dindot, Scott V; Sotelo-Mundo, Rogerio R; Criscitiello, Michael F

    2014-11-25

    We present a new transcriptome assembly of the Pacific whiteleg shrimp (Litopenaeus vannamei), the species most farmed for human consumption. Its functional annotation, a substantial improvement over previous ones, is provided freely. RNA-Seq with Illumina HiSeq technology was used to analyze samples extracted from shrimp abdominal muscle, hepatopancreas, gills and pleopods. We used the Trinity and Trinotate software suites for transcriptome assembly and annotation, respectively. The quality of this assembly and the affiliated targeted homology searches greatly enrich the curated transcripts currently available in public databases for this species. Comparison with the model arthropod Daphnia allows some insights into defining characteristics of decapod crustaceans. This large-scale gene discovery gives the broadest depth yet to the annotated transcriptome of this important species and should be of value to ongoing genomics and immunogenetic resistance studies in this shrimp of paramount global economic importance.

  12. Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case.

    PubMed

    Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik

    2014-12-05

    For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.

  13. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads

    PubMed Central

    Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas

    2017-01-01

    An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets. PMID:28467460

  14. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads.

    PubMed

    Petersen, Thomas Nordahl; Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas

    2017-01-01

    An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.

  15. Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts

    PubMed Central

    Cheng, Bing; Furtado, Agnelo

    2017-01-01

    Abstract Polyploidization contributes to the complexity of gene expression, resulting in numerous related but different transcripts. This study explored the transcriptome diversity and complexity of the tetraploid Arabica coffee (Coffea arabica) bean. Long-read sequencing (LRS) by Pacbio Isoform sequencing (Iso-seq) was used to obtain full-length transcripts without the difficulty and uncertainty of assembly required for reads from short-read technologies. The tetraploid transcriptome was annotated and compared with data from the sub-genome progenitors. Caffeine and sucrose genes were targeted for case analysis. An isoform-level tetraploid coffee bean reference transcriptome with 95 995 distinct transcripts (average 3236 bp) was obtained. A total of 88 715 sequences (92.42%) were annotated with BLASTx against NCBI non-redundant plant proteins, including 34 719 high-quality annotations. Further BLASTn analysis against NCBI non-redundant nucleotide sequences, Coffea canephora coding sequences with UTR, C. arabica ESTs, and Rfam resulted in 1213 sequences without hits, were potential novel genes in coffee. Longer UTRs were captured, especially in the 5΄UTRs, facilitating the identification of upstream open reading frames. The LRS also revealed more and longer transcript variants in key caffeine and sucrose metabolism genes from this polyploid genome. Long sequences (>10 kilo base) were poorly annotated. LRS technology shows the limitation of previous studies. It provides an important tool to produce a reference transcriptome including more of the diversity of full-length transcripts to help understand the biology and support the genetic improvement of polyploid species such as coffee. PMID:29048540

  16. DynGO: a tool for visualizing and mining of Gene Ontology and its associations

    PubMed Central

    Liu, Hongfang; Hu, Zhang-Zhi; Wu, Cathy H

    2005-01-01

    Background A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. Results We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. Conclusion We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website . PMID:16091147

  17. Deep analysis of wild Vitis flower transcriptome reveals unexplored genome regions associated with sex specification.

    PubMed

    Ramos, Miguel Jesus Nunes; Coito, João Lucas; Fino, Joana; Cunha, Jorge; Silva, Helena; de Almeida, Patrícia Gomes; Costa, Maria Manuela Ribeiro; Amâncio, Sara; Paulo, Octávio S; Rocheta, Margarida

    2017-01-01

    RNA-seq of Vitis during early stages of bud development, in male, female and hermaphrodite flowers, identified new loci outside of annotated gene models, suggesting their involvement in sex establishment. The molecular mechanisms responsible for flower sex specification remain unclear for most plant species. In the case of V. vinifera ssp. vinifera, it is not fully understood what determines hermaphroditism in the domesticated subspecies and male or female flowers in wild dioecious relatives (Vitis vinifera ssp. sylvestris). Here, we describe a de novo assembly of the transcriptome of three flower developmental stages from the three Vitis vinifera flower types. The validation of de novo assembly showed a correlation of 0.825. The main goals of this work were the identification of V. v. sylvestris exclusive transcripts and the characterization of differential gene expression during flower development. RNA from several flower developmental stages was used previously to generate Illumina sequence reads. Through a sequential de novo assembly strategy one comprehensive transcriptome comprising 95,516 non-redundant transcripts was assembled. From this dataset 81,064 transcripts were annotated to V. v. vinifera reference transcriptome and 11,084 were annotated against V. v. vinifera reference genome. Moreover, we found 3368 transcripts that could not be mapped to Vitis reference genome. From all the non-redundant transcripts that were assembled, bioinformatics analysis identified 133 specific of V. v. sylvestris and 516 transcripts differentially expressed among the three flower types. The detection of transcription from areas of the genome not currently annotated suggests active transcription of previously unannotated genomic loci during early stages of bud development.

  18. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  19. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  20. Using comparative genome analysis to identify problems in annotated microbial genomes.

    PubMed

    Poptsova, Maria S; Gogarten, J Peter

    2010-07-01

    Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

  1. Compelling Evidence for a Prostate Cancer Gene at 22q12.3 by the International Consortium for Prostate Cancer Genetics

    PubMed Central

    Camp, Nicola J.; Cannon-Albright, Lisa A.; Farnham, James M.; Baffoe-Bonnie, Agnes B.; George, Asha; Powell, Isaac; Bailey-Wilson, Joan E.; Carpten, John D.; Giles, Graham G.; Hopper, John L.; Severi, Gianluca; English, Dallas R.; Foulkes, William D.; Maehle, Lovise; Moller, Pal; Eeles, Ros; Easton, Douglas; Badzioch, Michael D.; Whittemore, Alice S.; Oakley-Girvan, Ingrid; Hsieh, Chih-Lin; Dimitrov, Latchezar; Xu, Jianfeng; Stanford, Janet L.; Johanneson, Bo; Deutsch, Kerry; McIntosh, Laura; Ostrander, Elaine A.; Wiley, Kathleen E.; Isaacs, Sarah D.; Walsh, Patrick C.; Thibodeau, Stephen N.; McDonnell, Shannon K.; Hebbring, Scott; Schaid, Daniel J.; Lange, Ethan M.; Cooney, Kathleen A.; Tammela, Teuvo L.J.; Schleutker, Johanna; Paiss, Thomas; Maier, Christiane; Grönberg, Henrik; Wiklund, Fredrik; Emanuelsson, Monica; Isaacs, William B.

    2009-01-01

    Previously, an analysis of 14 extended, high-risk Utah pedigrees localized the chromosome 22q linkage region to 3.2 Mb at 22q12.3-13.1 (flanked on each side by three recombinants), which contained 31 annotated genes. In this large, multi-centered, collaborative study, we performed statistical recombinant mapping in fifty-four pedigrees selected to be informative for recombinant mapping from nine member groups of the International Consortium for Prostate Cancer Genetics (ICPCG). These 54 pedigrees included the 14 extended pedigrees from Utah and 40 pedigrees from eight other ICPCG member groups. The additional 40 pedigrees were selected from a total pool of 1,213 such that each pedigree was required to both contain at least four prostate cancer (PRCA) cases and exhibit evidence for linkage to the chromosome 22q region. The recombinant events in these 40 independent pedigrees confirmed the previously proposed region. Further, when all 54 pedigrees were considered, the three-recombinant consensus region was narrowed by more than a megabase to 2.2 Mb at chromosome 22q12.3 flanked by D22S281 and D22S683. This narrower region eliminated 20 annotated genes from that previously proposed, leaving only eleven genes. This region at 22q12.3 is the most consistently identified and smallest linkage region for PRCA. This collaborative study by the ICPCG illustrates the value of consortium efforts and the continued utility of linkage analysis using informative pedigrees to localize genes for complex diseases. PMID:17478474

  2. Multiconstrained gene clustering based on generalized projections

    PubMed Central

    2010-01-01

    Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions. PMID:20356386

  3. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing.

    PubMed

    Lagarde, Julien; Uszczynska-Ratajczak, Barbara; Carbonell, Silvia; Pérez-Lluch, Sílvia; Abad, Amaya; Davis, Carrie; Gingeras, Thomas R; Frankish, Adam; Harrow, Jennifer; Guigo, Roderic; Johnson, Rory

    2017-12-01

    Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.

  4. Evaluating Computational Gene Ontology Annotations.

    PubMed

    Škunca, Nives; Roberts, Richard J; Steffen, Martin

    2017-01-01

    Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.

  5. MATRIX FACTORIZATION-BASED DATA FUSION FOR GENE FUNCTION PREDICTION IN BAKER’S YEAST AND SLIME MOLD

    PubMed Central

    ŽITNIK, MARINKA; ZUPAN, BLAŽ

    2014-01-01

    The development of effective methods for the characterization of gene functions that are able to combine diverse data sources in a sound and easily-extendible way is an important goal in computational biology. We have previously developed a general matrix factorization-based data fusion approach for gene function prediction. In this manuscript, we show that this data fusion approach can be applied to gene function prediction and that it can fuse various heterogeneous data sources, such as gene expression profiles, known protein annotations, interaction and literature data. The fusion is achieved by simultaneous matrix tri-factorization that shares matrix factors between sources. We demonstrate the effectiveness of the approach by evaluating its performance on predicting ontological annotations in slime mold D. discoideum and on recognizing proteins of baker’s yeast S. cerevisiae that participate in the ribosome or are located in the cell membrane. Our approach achieves predictive performance comparable to that of the state-of-the-art kernel-based data fusion, but requires fewer data preprocessing steps. PMID:24297565

  6. Discovery of rare protein-coding genes in model methylotroph Methylobacterium extorquens AM1.

    PubMed

    Kumar, Dhirendra; Mondal, Anupam Kumar; Yadav, Amit Kumar; Dash, Debasis

    2014-12-01

    Proteogenomics involves the use of MS to refine annotation of protein-coding genes and discover genes in a genome. We carried out comprehensive proteogenomic analysis of Methylobacterium extorquens AM1 (ME-AM1) from publicly available proteomics data with a motive to improve annotation for methylotrophs; organisms capable of surviving in reduced carbon compounds such as methanol. Besides identifying 2482(50%) proteins, 29 new genes were discovered and 66 annotated gene models were revised in ME-AM1 genome. One such novel gene is identified with 75 peptides, lacks homolog in other methylobacteria but has glycosyl transferase and lipopolysaccharide biosynthesis protein domains, indicating its potential role in outer membrane synthesis. Many novel genes are present only in ME-AM1 among methylobacteria. Distant homologs of these genes in unrelated taxonomic classes and low GC-content of few genes suggest lateral gene transfer as a potential mode of their origin. Annotations of methylotrophy related genes were also improved by the discovery of a short gene in methylotrophy gene island and redefining a gene important for pyrroquinoline quinone synthesis, essential for methylotrophy. The combined use of proteogenomics and rigorous bioinformatics analysis greatly enhanced the annotation of protein-coding genes in model methylotroph ME-AM1 genome. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. WEGO 2.0: a web tool for analyzing and plotting GO annotations, 2018 update.

    PubMed

    Ye, Jia; Zhang, Yong; Cui, Huihai; Liu, Jiawei; Wu, Yuqing; Cheng, Yun; Xu, Huixing; Huang, Xingxin; Li, Shengting; Zhou, An; Zhang, Xiuqing; Bolund, Lars; Chen, Qiang; Wang, Jian; Yang, Huanming; Fang, Lin; Shi, Chunmei

    2018-05-18

    WEGO (Web Gene Ontology Annotation Plot), created in 2006, is a simple but useful tool for visualizing, comparing and plotting GO (Gene Ontology) annotation results. Owing largely to the rapid development of high-throughput sequencing and the increasing acceptance of GO, WEGO has benefitted from outstanding performance regarding the number of users and citations in recent years, which motivated us to update to version 2.0. WEGO uses the GO annotation results as input. Based on GO's standardized DAG (Directed Acyclic Graph) structured vocabulary system, the number of genes corresponding to each GO ID is calculated and shown in a graphical format. WEGO 2.0 updates have targeted four aspects, aiming to provide a more efficient and up-to-date approach for comparative genomic analyses. First, the number of input files, previously limited to three, is now unlimited, allowing WEGO to analyze multiple datasets. Also added in this version are the reference datasets of nine model species that can be adopted as baselines in genomic comparative analyses. Furthermore, in the analyzing processes each Chi-square test is carried out for multiple datasets instead of every two samples. At last, WEGO 2.0 provides an additional output graph along with the traditional WEGO histogram, displaying the sorted P-values of GO terms and indicating their significant differences. At the same time, WEGO 2.0 features an entirely new user interface. WEGO is available for free at http://wego.genomics.org.cn.

  8. Analyzing gene expression data in mice with the Neuro Behavior Ontology.

    PubMed

    Hoehndorf, Robert; Hancock, John M; Hardy, Nigel W; Mallon, Ann-Marie; Schofield, Paul N; Gkoutos, Georgios V

    2014-02-01

    We have applied the Neuro Behavior Ontology (NBO), an ontology for the annotation of behavioral gene functions and behavioral phenotypes, to the annotation of more than 1,000 genes in the mouse that are known to play a role in behavior. These annotations can be explored by researchers interested in genes involved in particular behaviors and used computationally to provide insights into the behavioral phenotypes resulting from differences in gene expression. We developed the OntoFUNC tool and have applied it to enrichment analyses over the NBO to provide high-level behavioral interpretations of gene expression datasets. The resulting increase in the number of gene annotations facilitates the identification of behavioral or neurologic processes by assisting the formulation of hypotheses about the relationships between gene, processes, and phenotypic manifestations resulting from behavioral observations.

  9. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation.

    PubMed

    Sharma, Virag; Hiller, Michael

    2017-08-21

    Genome alignments provide a powerful basis to transfer gene annotations from a well-annotated reference genome to many other aligned genomes. The completeness of these annotations crucially depends on the sensitivity of the underlying genome alignment. Here, we investigated the impact of the genome alignment parameters and found that parameters with a higher sensitivity allow the detection of thousands of novel alignments between orthologous exons that have been missed before. In particular, comparisons between species separated by an evolutionary distance of >0.75 substitutions per neutral site, like human and other non-placental vertebrates, benefit from increased sensitivity. To systematically test if increased sensitivity improves comparative gene annotations, we built a multiple alignment of 144 vertebrate genomes and used this alignment to map human genes to the other 143 vertebrates with CESAR. We found that higher alignment sensitivity substantially improves the completeness of comparative gene annotations by adding on average 2382 and 7440 novel exons and 117 and 317 novel genes for mammalian and non-mammalian species, respectively. Our results suggest a more sensitive alignment strategy that should generally be used for genome alignments between distantly-related species. Our 144-vertebrate genome alignment and the comparative gene annotations (https://bds.mpi-cbg.de/hillerlab/144VertebrateAlignment_CESAR/) are a valuable resource for comparative genomics. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. IMG ER: a system for microbial genome annotation expert review and curation.

    PubMed

    Markowitz, Victor M; Mavromatis, Konstantinos; Ivanova, Natalia N; Chen, I-Min A; Chu, Ken; Kyrpides, Nikos C

    2009-09-01

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  11. High-performance web services for querying gene and variant annotation.

    PubMed

    Xin, Jiwen; Mark, Adam; Afrasiabi, Cyrus; Tsueng, Ginger; Juchler, Moritz; Gopal, Nikhil; Stupp, Gregory S; Putman, Timothy E; Ainscough, Benjamin J; Griffith, Obi L; Torkamani, Ali; Whetzel, Patricia L; Mungall, Christopher J; Mooney, Sean D; Su, Andrew I; Wu, Chunlei

    2016-05-06

    Efficient tools for data management and integration are essential for many aspects of high-throughput biology. In particular, annotations of genes and human genetic variants are commonly used but highly fragmented across many resources. Here, we describe MyGene.info and MyVariant.info, high-performance web services for querying gene and variant annotation information. These web services are currently accessed more than three million times permonth. They also demonstrate a generalizable cloud-based model for organizing and querying biological annotation information. MyGene.info and MyVariant.info are provided as high-performance web services, accessible at http://mygene.info and http://myvariant.info . Both are offered free of charge to the research community.

  12. The emerging role of genomics in the diagnosis and workup of congenital urinary tract defects: a novel deletion syndrome on chromosome 3q13.31-22.1

    PubMed Central

    Materna-Kiryluk, Anna; Kiryluk, Krzysztof; Burgess, Katelyn E; Bieleninik, Arkadiusz; Sanna-Cherchi, Simone; Gharavi, Ali G.; Latos-Bielenska, Anna

    2014-01-01

    Background Copy number variants (CNVs) are increasingly recognized as an important cause of congenital malformations and likely explain over 16% cases of CAKUT. Here, we illustrate how a molecular diagnosis of CNV can inform the clinical management of a pediatric patient presenting with CAKUT and other organ defects. Methods We describe a 14 year-old girl with a large de novo deletion of chromosome 3q13.31-22.1 that disrupts 101 known genes and manifests with CAKUT, neurodevelopmental delay, agenesis of corpus callosum (ACC), cardiac malformations, electrolyte and endocrine disorders, skeletal abnormalities and dysmorphic features. We perform extensive annotation of the deleted region to prioritize genes for specific phenotypes and to predict future disease risk. Results Our case defined new minimal chromosomal candidate regions for both CAKUT and ACC. Moreover, the presence of the CASR gene in the deleted interval predicted a diagnosis of hypocalciuric hypercalcemia, which was confirmed by serum and urine chemistries. Our gene annotation explained clinical hypothyroidism and predicted that the index case is at increased risk of thoracic aortic aneurysm, renal cell carcinoma and myeloproliferative disorder. Conclusions Extended annotation of CNV regions refines diagnosis and uncovers previously unrecognized phenotypic features. This approach enables personalized treatment and prevention strategies in patients harboring genomic deletions. PMID:24292865

  13. Fuzzy measures on the Gene Ontology for gene product similarity.

    PubMed

    Popescu, Mihail; Keller, James M; Mitchell, Joyce A

    2006-01-01

    One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.

  14. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify codingmore » regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.« less

  15. Transcriptome de novo assembly from next-generation sequencing and comparative analyses in the hexaploid salt marsh species Spartina maritima and Spartina alterniflora (Poaceae)

    PubMed Central

    Ferreira de Carvalho, J; Poulain, J; Da Silva, C; Wincker, P; Michon-Coudouel, S; Dheilly, A; Naquin, D; Boutte, J; Salmon, A; Ainouche, M

    2013-01-01

    Spartina species have a critical ecological role in salt marshes and represent an excellent system to investigate recurrent polyploid speciation. Using the 454 GS-FLX pyrosequencer, we assembled and annotated the first reference transcriptome (from roots and leaves) for two related hexaploid Spartina species that hybridize in Western Europe, the East American invasive Spartina alterniflora and the Euro-African S. maritima. The de novo read assembly generated 38 478 consensus sequences and 99% found an annotation using Poaceae databases, representing a total of 16 753 non-redundant genes. Spartina expressed sequence tags were mapped onto the Sorghum bicolor genome, where they were distributed among the subtelomeric arms of the 10 S. bicolor chromosomes, with high gene density correlation. Normalization of the complementary DNA library improved the number of annotated genes. Ecologically relevant genes were identified among GO biological function categories in salt and heavy metal stress response, C4 photosynthesis and in lignin and cellulose metabolism. Expression of some of these genes had been found to be altered by hybridization and genome duplication in a previous microarray-based study in Spartina. As these species are hexaploid, up to three duplicated homoeologs may be expected per locus. When analyzing sequence polymorphism at four different loci in S. maritima and S. alterniflora, we found up to four haplotypes per locus, suggesting the presence of two expressed homoeologous sequences with one or two allelic variants each. This reference transcriptome will allow analysis of specific Spartina genes of ecological or evolutionary interest, estimation of homoeologous gene expression variation using RNA-seq and further gene expression evolution analyses in natural populations. PMID:23149455

  16. GeneMachine: gene prediction and sequence annotation.

    PubMed

    Makalowska, I; Ryan, J F; Baxevanis, A D

    2001-09-01

    A number of free-standing programs have been developed in order to help researchers find potential coding regions and deduce gene structure for long stretches of what is essentially 'anonymous DNA'. As these programs apply inherently different criteria to the question of what is and is not a coding region, multiple algorithms should be used in the course of positional cloning and positional candidate projects to assure that all potential coding regions within a previously-identified critical region are identified. We have developed a gene identification tool called GeneMachine which allows users to query multiple exon and gene prediction programs in an automated fashion. BLAST searches are also performed in order to see whether a previously-characterized coding region corresponds to a region in the query sequence. A suite of Perl programs and modules are used to run MZEF, GENSCAN, GRAIL 2, FGENES, RepeatMasker, Sputnik, and BLAST. The results of these runs are then parsed and written into ASN.1 format. Output files can be opened using NCBI Sequin, in essence using Sequin as both a workbench and as a graphical viewer. The main feature of GeneMachine is that the process is fully automated; the user is only required to launch GeneMachine and then open the resulting file with Sequin. Annotations can then be made to these results prior to submission to GenBank, thereby increasing the intrinsic value of these data. GeneMachine is freely-available for download at http://genome.nhgri.nih.gov/genemachine. A public Web interface to the GeneMachine server for academic and not-for-profit users is available at http://genemachine.nhgri.nih.gov. The Web supplement to this paper may be found at http://genome.nhgri.nih.gov/genemachine/supplement/.

  17. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    PubMed Central

    Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M. T.; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D.; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D.; Henry, Christopher S.

    2014-01-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today’s annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  18. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.

  19. Transcriptome sequence analysis of an ornamental plant, Ananas comosus var. bracteatus, revealed the potential unigenes involved in terpenoid and phenylpropanoid biosynthesis.

    PubMed

    Ma, Jun; Kanakala, S; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

  20. Transcriptome Sequence Analysis of an Ornamental Plant, Ananas comosus var. bracteatus, Revealed the Potential Unigenes Involved in Terpenoid and Phenylpropanoid Biosynthesis

    PubMed Central

    Ma, Jun; Kanakala, S.; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Background Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. Results The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. Conclusion The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus. PMID:25769053

  1. Dizeez: An Online Game for Human Gene-Disease Annotation

    PubMed Central

    Loguercio, Salvatore; Good, Benjamin M.; Su, Andrew I.

    2013-01-01

    Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org. PMID:23951102

  2. Determining Semantically Related Significant Genes.

    PubMed

    Taha, Kamal

    2014-01-01

    GO relation embodies some aspects of existence dependency. If GO term xis existence-dependent on GO term y, the presence of y implies the presence of x. Therefore, the genes annotated with the function of the GO term y are usually functionally and semantically related to the genes annotated with the function of the GO term x. A large number of gene set enrichment analysis methods have been developed in recent years for analyzing gene sets enrichment. However, most of these methods overlook the structural dependencies between GO terms in GO graph by not considering the concept of existence dependency. We propose in this paper a biological search engine called RSGSearch that identifies enriched sets of genes annotated with different functions using the concept of existence dependency. We observe that GO term xcannot be existence-dependent on GO term y, if x- and y- have the same specificity (biological characteristics). After encoding into a numeric format the contributions of GO terms annotating target genes to the semantics of their lowest common ancestors (LCAs), RSGSearch uses microarray experiment to identify the most significant LCA that annotates the result genes. We evaluated RSGSearch experimentally and compared it with five gene set enrichment systems. Results showed marked improvement.

  3. Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50).

    PubMed

    Williams, John L; Iamartino, Daniela; Pruitt, Kim D; Sonstegard, Tad; Smith, Timothy P L; Low, Wai Yee; Biagini, Tommaso; Bomba, Lorenzo; Capomaccio, Stefano; Castiglioni, Bianca; Coletta, Angelo; Corrado, Federica; Ferré, Fabrizio; Iannuzzi, Leopoldo; Lawley, Cynthia; Macciotta, Nicolò; McClure, Matthew; Mancini, Giordano; Matassino, Donato; Mazza, Raffaele; Milanesi, Marco; Moioli, Bianca; Morandi, Nicola; Ramunno, Luigi; Peretti, Vincenzo; Pilla, Fabio; Ramelli, Paola; Schroeder, Steven; Strozzi, Francesco; Thibaud-Nissen, Francoise; Zicarelli, Luigi; Ajmone-Marsan, Paolo; Valentini, Alessio; Chillemi, Giovanni; Zimin, Aleksey

    2017-10-01

    Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2 n = 50) and the swamp (2 n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1. © The Author 2017. Published by Oxford University Press.

  4. Comprehensive coverage of cardiovascular disease data in the disease portals at the Rat Genome Database.

    PubMed

    Wang, Shur-Jen; Laulederkind, Stanley J F; Hayman, G Thomas; Petri, Victoria; Smith, Jennifer R; Tutaj, Marek; Nigam, Rajni; Dwinell, Melinda R; Shimoyama, Mary

    2016-08-01

    Cardiovascular diseases are complex diseases caused by a combination of genetic and environmental factors. To facilitate progress in complex disease research, the Rat Genome Database (RGD) provides the community with a disease portal where genome objects and biological data related to cardiovascular diseases are systematically organized. The purpose of this study is to present biocuration at RGD, including disease, genetic, and pathway data. The RGD curation team uses controlled vocabularies/ontologies to organize data curated from the published literature or imported from disease and pathway databases. These organized annotations are associated with genes, strains, and quantitative trait loci (QTLs), thus linking functional annotations to genome objects. Screen shots from the web pages are used to demonstrate the organization of annotations at RGD. The human cardiovascular disease genes identified by annotations were grouped according to data sources and their annotation profiles were compared by in-house tools and other enrichment tools available to the public. The analysis results show that the imported cardiovascular disease genes from ClinVar and OMIM are functionally different from the RGD manually curated genes in terms of pathway and Gene Ontology annotations. The inclusion of disease genes from other databases enriches the collection of disease genes not only in quantity but also in quality. Copyright © 2016 the American Physiological Society.

  5. NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology.

    PubMed

    Wei, Qing; Khan, Ishita K; Ding, Ziyun; Yerneni, Satwica; Kihara, Daisuke

    2017-03-20

    The number of genomics and proteomics experiments is growing rapidly, producing an ever-increasing amount of data that are awaiting functional interpretation. A number of function prediction algorithms were developed and improved to enable fast and automatic function annotation. With the well-defined structure and manual curation, Gene Ontology (GO) is the most frequently used vocabulary for representing gene functions. To understand relationship and similarity between GO annotations of genes, it is important to have a convenient pipeline that quantifies and visualizes the GO function analyses in a systematic fashion. NaviGO is a web-based tool for interactive visualization, retrieval, and computation of functional similarity and associations of GO terms and genes. Similarity of GO terms and gene functions is quantified with six different scores including protein-protein interaction and context based association scores we have developed in our previous works. Interactive navigation of the GO function space provides intuitive and effective real-time visualization of functional groupings of GO terms and genes as well as statistical analysis of enriched functions. We developed NaviGO, which visualizes and analyses functional similarity and associations of GO terms and genes. The NaviGO webserver is freely available at: http://kiharalab.org/web/navigo .

  6. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  7. Widespread occurrence of organelle genome-encoded 5S rRNAs including permuted molecules

    PubMed Central

    Valach, Matus; Burger, Gertraud; Gray, Michael W.; Lang, B. Franz

    2014-01-01

    5S Ribosomal RNA (5S rRNA) is a universal component of ribosomes, and the corresponding gene is easily identified in archaeal, bacterial and nuclear genome sequences. However, organelle gene homologs (rrn5) appear to be absent from most mitochondrial and several chloroplast genomes. Here, we re-examine the distribution of organelle rrn5 by building mitochondrion- and plastid-specific covariance models (CMs) with which we screened organelle genome sequences. We not only recover all organelle rrn5 genes annotated in GenBank records, but also identify more than 50 previously unrecognized homologs in mitochondrial genomes of various stramenopiles, red algae, cryptomonads, malawimonads and apusozoans, and surprisingly, in the apicoplast (highly derived plastid) genomes of the coccidian pathogens Toxoplasma gondii and Eimeria tenella. Comparative modeling of RNA secondary structure reveals that mitochondrial 5S rRNAs from brown algae adopt a permuted triskelion shape that has not been seen elsewhere. Expression of the newly predicted rrn5 genes is confirmed experimentally in 10 instances, based on our own and published RNA-Seq data. This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought. The newly developed CMs will be made available via the Rfam database and the MFannot organelle genome annotator. PMID:25429974

  8. Widespread occurrence of organelle genome-encoded 5S rRNAs including permuted molecules.

    PubMed

    Valach, Matus; Burger, Gertraud; Gray, Michael W; Lang, B Franz

    2014-12-16

    5S Ribosomal RNA (5S rRNA) is a universal component of ribosomes, and the corresponding gene is easily identified in archaeal, bacterial and nuclear genome sequences. However, organelle gene homologs (rrn5) appear to be absent from most mitochondrial and several chloroplast genomes. Here, we re-examine the distribution of organelle rrn5 by building mitochondrion- and plastid-specific covariance models (CMs) with which we screened organelle genome sequences. We not only recover all organelle rrn5 genes annotated in GenBank records, but also identify more than 50 previously unrecognized homologs in mitochondrial genomes of various stramenopiles, red algae, cryptomonads, malawimonads and apusozoans, and surprisingly, in the apicoplast (highly derived plastid) genomes of the coccidian pathogens Toxoplasma gondii and Eimeria tenella. Comparative modeling of RNA secondary structure reveals that mitochondrial 5S rRNAs from brown algae adopt a permuted triskelion shape that has not been seen elsewhere. Expression of the newly predicted rrn5 genes is confirmed experimentally in 10 instances, based on our own and published RNA-Seq data. This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought. The newly developed CMs will be made available via the Rfam database and the MFannot organelle genome annotator. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. A transducer for microbial sensory rhodopsin that adopts GTG as a start codon is identified in Haloarcula marismortui.

    PubMed

    Fu, Hsu-Yuan; Lu, Yen-Hsu; Yi, Hsiu-Ping; Yang, Chii-Shen

    2013-04-05

    Microbial sensory rhodopsins are known to mediate phototaxis, and all of the known sensory rhodopsins execute this function with a specific cognate transducer that has two-transmembrane (2-TM) regions. In the genome of Haloarcula marismortui, a total of six rhodopsin genes were annotated, and we previously showed three of them to be the ion type and suggested the other three as sensory type, even though the candidate transducer gene, htr, for HmSRI was missing the 2-TM region that is found in all of the other known transducers. Here we showed this htr gene featured a preceding 2-TM region when the alternative start codon GTG located 291 nucleotides upstream of the original annotated open reading frame (ORF) was introduced and it is named as htrI in this study. Overexpression of HmHtrI exhibited it existed as a membrane protein and several biophysical assays confirmed it functionally interacted with HmSRI. Together with our previous reverse-transcriptase-PCR results and phototaxis measurements, the new ORF of original predicted soluble htr gene product was a membrane protein with a 2-TM region, HmHtrI; and it serves as the cognate transducer for HmSRI. HmHtrI therefore is the first transducer for the sensory rhodopsin adopted start codon other than ATG. Copyright © 2013 Elsevier B.V. All rights reserved.

  10. Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae

    PubMed Central

    2013-01-01

    Background Secondary metabolite production, a hallmark of filamentous fungi, is an expanding area of research for the Aspergilli. These compounds are potent chemicals, ranging from deadly toxins to therapeutic antibiotics to potential anti-cancer drugs. The genome sequences for multiple Aspergilli have been determined, and provide a wealth of predictive information about secondary metabolite production. Sequence analysis and gene overexpression strategies have enabled the discovery of novel secondary metabolites and the genes involved in their biosynthesis. The Aspergillus Genome Database (AspGD) provides a central repository for gene annotation and protein information for Aspergillus species. These annotations include Gene Ontology (GO) terms, phenotype data, gene names and descriptions and they are crucial for interpreting both small- and large-scale data and for aiding in the design of new experiments that further Aspergillus research. Results We have manually curated Biological Process GO annotations for all genes in AspGD with recorded functions in secondary metabolite production, adding new GO terms that specifically describe each secondary metabolite. We then leveraged these new annotations to predict roles in secondary metabolism for genes lacking experimental characterization. As a starting point for manually annotating Aspergillus secondary metabolite gene clusters, we used antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) and SMURF (Secondary Metabolite Unknown Regions Finder) algorithms to identify potential clusters in A. nidulans, A. fumigatus, A. niger and A. oryzae, which we subsequently refined through manual curation. Conclusions This set of 266 manually curated secondary metabolite gene clusters will facilitate the investigation of novel Aspergillus secondary metabolites. PMID:23617571

  11. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis

    PubMed Central

    Tellgren-Roth, Christian; Baudo, Charles D.; Kennell, John C.; Sun, Sheng; Billmyre, R. Blake; Schröder, Markus S.; Andersson, Anna; Holm, Tina; Sigurgeirsson, Benjamin; Wu, Guangxi; Sankaranarayanan, Sundar Ram; Siddharthan, Rahul; Sanyal, Kaustuv; Lundeberg, Joakim; Nystedt, Björn; Boekhout, Teun; Dawson, Thomas L.; Heitman, Joseph

    2017-01-01

    Abstract Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies. PMID:28100699

  12. Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression

    PubMed Central

    Libbrecht, Maxwell W.; Ay, Ferhat; Hoffman, Michael M.; Gilbert, David M.; Bilmes, Jeffrey A.; Noble, William Stafford

    2015-01-01

    The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term “specific expression domains.” We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data. PMID:25677182

  13. Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression.

    PubMed

    Libbrecht, Maxwell W; Ay, Ferhat; Hoffman, Michael M; Gilbert, David M; Bilmes, Jeffrey A; Noble, William Stafford

    2015-04-01

    The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data. © 2015 Libbrecht et al.; Published by Cold Spring Harbor Laboratory Press.

  14. MicroScope—an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data

    PubMed Central

    Vallenet, David; Belda, Eugeni; Calteau, Alexandra; Cruveiller, Stéphane; Engelen, Stefan; Lajus, Aurélie; Le Fèvre, François; Longin, Cyrille; Mornico, Damien; Roche, David; Rouy, Zoé; Salvignol, Gregory; Scarpelli, Claude; Thil Smith, Adam Alexander; Weiman, Marion; Médigue, Claudine

    2013-01-01

    MicroScope is an integrated platform dedicated to both the methodical updating of microbial genome annotation and to comparative analysis. The resource provides data from completed and ongoing genome projects (automatic and expert annotations), together with data sources from post-genomic experiments (i.e. transcriptomics, mutant collections) allowing users to perfect and improve the understanding of gene functions. MicroScope (http://www.genoscope.cns.fr/agc/microscope) combines tools and graphical interfaces to analyse genomes and to perform the manual curation of gene annotations in a comparative context. Since its first publication in January 2006, the system (previously named MaGe for Magnifying Genomes) has been continuously extended both in terms of data content and analysis tools. The last update of MicroScope was published in 2009 in the Database journal. Today, the resource contains data for >1600 microbial genomes, of which ∼300 are manually curated and maintained by biologists (1200 personal accounts today). Expert annotations are continuously gathered in the MicroScope database (∼50 000 a year), contributing to the improvement of the quality of microbial genomes annotations. Improved data browsing and searching tools have been added, original tools useful in the context of expert annotation have been developed and integrated and the website has been significantly redesigned to be more user-friendly. Furthermore, in the context of the European project Microme (Framework Program 7 Collaborative Project), MicroScope is becoming a resource providing for the curation and analysis of both genomic and metabolic data. An increasing number of projects are related to the study of environmental bacterial (meta)genomes that are able to metabolize a large variety of chemical compounds that may be of high industrial interest. PMID:23193269

  15. Resequencing and annotation of the Nostoc punctiforme ATTC 29133 genome: facilitating biofuel and high-value chemical production

    DOE PAGES

    Moraes, Luis E.; Blow, Matthew J.; Hawley, Erik R.; ...

    2017-02-16

    Cyanobacteria have the potential to produce bulk and fine chemicals and members belonging to Nostoc sp. have received particular attention due to their relatively fast growth rate and the relative ease with which they can be harvested. Nostoc punctiforme is an aerobic, motile, Gram-negative, filamentous cyanobacterium that has been studied intensively to enhance our understanding of microbial carbon and nitrogen fixation. The genome of the type strain N. punctiforme ATCC 29133 was sequenced in 2001 and the scientific community has used these genome data extensively since then. Advances in bioinformatics tools for sequence annotation and the importance of this organismmore » prompted us to resequence and reanalyze its genome and to make both, the initial and improved annotation, available to the scientific community. The new draft genome has a total size of 9.1 Mbp and consists of 65 contiguous pieces of DNA with a GC content of 41.38% and 7664 protein-coding genes. Furthermore, the resequenced genome is slightly (5152 bp) larger and contains 987 more genes with functional prediction when compared to the previously published version. We deposited the annotation of both genomes in the Department of Energy’s IMG database to facilitate easy genome exploration by the scientific community without the need of in-depth bioinformatics skills. We expect that an facilitated access and ability to search the N. punctiforme ATCC 29133 for genes of interest will significantly facilitate metabolic engineering and genome prospecting efforts and ultimately the synthesis of biofuels and natural products from this keystone organism and closely related cyanobacteria.« less

  16. Resequencing and annotation of the Nostoc punctiforme ATTC 29133 genome: facilitating biofuel and high-value chemical production

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Moraes, Luis E.; Blow, Matthew J.; Hawley, Erik R.

    Cyanobacteria have the potential to produce bulk and fine chemicals and members belonging to Nostoc sp. have received particular attention due to their relatively fast growth rate and the relative ease with which they can be harvested. Nostoc punctiforme is an aerobic, motile, Gram-negative, filamentous cyanobacterium that has been studied intensively to enhance our understanding of microbial carbon and nitrogen fixation. The genome of the type strain N. punctiforme ATCC 29133 was sequenced in 2001 and the scientific community has used these genome data extensively since then. Advances in bioinformatics tools for sequence annotation and the importance of this organismmore » prompted us to resequence and reanalyze its genome and to make both, the initial and improved annotation, available to the scientific community. The new draft genome has a total size of 9.1 Mbp and consists of 65 contiguous pieces of DNA with a GC content of 41.38% and 7664 protein-coding genes. Furthermore, the resequenced genome is slightly (5152 bp) larger and contains 987 more genes with functional prediction when compared to the previously published version. We deposited the annotation of both genomes in the Department of Energy’s IMG database to facilitate easy genome exploration by the scientific community without the need of in-depth bioinformatics skills. We expect that an facilitated access and ability to search the N. punctiforme ATCC 29133 for genes of interest will significantly facilitate metabolic engineering and genome prospecting efforts and ultimately the synthesis of biofuels and natural products from this keystone organism and closely related cyanobacteria.« less

  17. ADGO: analysis of differentially expressed gene sets using composite GO annotation.

    PubMed

    Nam, Dougu; Kim, Sang-Bae; Kim, Seon-Kyu; Yang, Sungjin; Kim, Seon-Young; Chu, In-Sun

    2006-09-15

    Genes are typically expressed in modular manners in biological processes. Recent studies reflect such features in analyzing gene expression patterns by directly scoring gene sets. Gene annotations have been used to define the gene sets, which have served to reveal specific biological themes from expression data. However, current annotations have limited analytical power, because they are classified by single categories providing only unary information for the gene sets. Here we propose a method for discovering composite biological themes from expression data. We intersected two annotated gene sets from different categories of Gene Ontology (GO). We then scored the expression changes of all the single and intersected sets. In this way, we were able to uncover, for example, a gene set with the molecular function F and the cellular component C that showed significant expression change, while the changes in individual gene sets were not significant. We provided an exemplary analysis for HIV-1 immune response. In addition, we tested the method on 20 public datasets where we found many 'filtered' composite terms the number of which reached approximately 34% (a strong criterion, 5% significance) of the number of significant unary terms on average. By using composite annotation, we can derive new and improved information about disease and biological processes from expression data. We provide a web application (ADGO: http://array.kobic.re.kr/ADGO) for the analysis of differentially expressed gene sets with composite GO annotations. The user can analyze Affymetrix and dual channel array (spotted cDNA and spotted oligo microarray) data for four species: human, mouse, rat and yeast. chu@kribb.re.kr http://array.kobic.re.kr/ADGO.

  18. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    USDA-ARS?s Scientific Manuscript database

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  19. Using Gene Ontology to describe the role of the neurexin-neuroligin-SHANK complex in human, mouse and rat and its relevance to autism.

    PubMed

    Patel, Sejal; Roncaglia, Paola; Lovering, Ruth C

    2015-06-06

    People with an autistic spectrum disorder (ASD) display a variety of characteristic behavioral traits, including impaired social interaction, communication difficulties and repetitive behavior. This complex neurodevelopment disorder is known to be associated with a combination of genetic and environmental factors. Neurexins and neuroligins play a key role in synaptogenesis and neurexin-neuroligin adhesion is one of several processes that have been implicated in autism spectrum disorders. In this report we describe the manual annotation of a selection of gene products known to be associated with autism and/or the neurexin-neuroligin-SHANK complex and demonstrate how a focused annotation approach leads to the creation of more descriptive Gene Ontology (GO) terms, as well as an increase in both the number of gene product annotations and their granularity, thus improving the data available in the GO database. The manual annotations we describe will impact on the functional analysis of a variety of future autism-relevant datasets. Comprehensive gene annotation is an essential aspect of genomic and proteomic studies, as the quality of gene annotations incorporated into statistical analysis tools affects the effective interpretation of data obtained through genome wide association studies, next generation sequencing, proteomic and transcriptomic datasets.

  20. [Transcriptome analysis of Dunaliella viridis].

    PubMed

    Zhu, Shuai-qi; Gong, Yi-fu; Hang, Yu-qing; Liu, Hao; Wang, He-yu

    2015-08-01

    In order to understand the gene information, function, haloduric pathway (glycerolipid metabolism) and related key genes for Dunaliella viridis, we used Illumina HiSeqTM 2000 high-throughput sequencing technology to sequence its transcriptome. Trinity soft was used to assemble the data to form transcripts. Based on the Clusters of Orthologous Groups (COG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG ) databases, we carried out functional annotation and classification, pathway annotation, and the opening reading fragment (ORF) sequence prediction of transcripts. The key genes in the glycerolipid metabolism were analyzed. The results suggested that 81,593 transcripts were found, and 77,117 ORF sequences were predicted, accounting for 94.50% of all transcripts. COG classification results showed that 16,569 transcripts were assigned to 24 categories. GO classification annotated 76,436 transcripts. The number of transcripts for biologcial processes was 30,678, accounting for 40.14% of all transcripts. KEGG pathway analysis showed that 26,428 transcripts were annotated to 317 pathways, and 131 pathways were related to metabolism, accounting for 41.32% of all annotated pathways. Only one transcript was annotated as coding the key enzyme dihydroxyacetone kinase involved in the glycerolipid pathway. This enzyme could be related to glycerol biosynthesis under salt stress. This study further improved the gene information and laid the foundation of metabolic pathway research for Dunaliella viridis.

  1. Genome re-annotation of the wild strawberry Fragaria vesca using extensive Illumina- and SMRT-based RNA-seq datasets

    PubMed Central

    Li, Yongping; Wei, Wei; Feng, Jia; Luo, Huifeng; Pi, Mengting; Liu, Zhongchi; Kang, Chunying

    2018-01-01

    Abstract The genome of the wild diploid strawberry species Fragaria vesca, an ideal model system of cultivated strawberry (Fragaria × ananassa, octoploid) and other Rosaceae family crops, was first published in 2011 and followed by a new assembly (Fvb). However, the annotation for Fvb mainly relied on ab initio predictions and included only predicted coding sequences, therefore an improved annotation is highly desirable. Here, a new annotation version named v2.0.a2 was created for the Fvb genome by a pipeline utilizing one PacBio library, 90 Illumina RNA-seq libraries, and 9 small RNA-seq libraries. Altogether, 18,641 genes (55.6% out of 33,538 genes) were augmented with information on the 5′ and/or 3′ UTRs, 13,168 (39.3%) protein-coding genes were modified or newly identified, and 7,370 genes were found to possess alternative isoforms. In addition, 1,938 long non-coding RNAs, 171 miRNAs, and 51,714 small RNA clusters were integrated into the annotation. This new annotation of F. vesca is substantially improved in both accuracy and integrity of gene predictions, beneficial to the gene functional studies in strawberry and to the comparative genomic analysis of other horticultural crops in Rosaceae family. PMID:29036429

  2. Analysis of expressed sequence tags from a NaHCO(3)-treated alkali-tolerant plant, Chloris virgata.

    PubMed

    Nishiuchi, Shunsaku; Fujihara, Kazumasa; Liu, Shenkui; Takano, Tetsuo

    2010-04-01

    Chloris virgata Swartz (C. virgata) is a gramineous wild plant that can survive in saline-alkali areas in northeast China. To examine the tolerance mechanisms of C. virgata, we constructed a cDNA library from whole plants of C. virgata that had been treated with 100 mM NaHCO(3) for 24 h and sequenced 3168 randomly selected clones. Most (2590) of the expressed sequence tags (ESTs) showed significant similarity to sequences in the NCBI database. Of the 2590 genes, 1893 were unique. Gene Ontology (GO) Slim annotations were obtained for 1081 ESTs by BLAST2GO and it was found that 75 genes of them were annotated with GO terms "response to stress", "response to abiotic stimulus", and "response to biotic stimulus", indicating these genes were likely to function in tolerance mechanism of C. virgata. In a separate experiment, 24 genes that are known from previous studies to be associated with abiotic stress tolerance were further examined by real-time RT-PCR to see how their expressions were affected by NaHCO(3) stress. NaHCO(3) treatment up-regulated the expressions of pathogenesis-related gene (DC998527), Win1 precursor gene (DC998617), catalase gene (DC999385), ribosome inactivating protein 1 (DC999555), Na(+)/H(+) antiporter gene (DC998043), and two-component regulator gene (DC998236). Copyright 2010 Elsevier Masson SAS. All rights reserved.

  3. A New Chicken Genome Assembly Provides Insight into Avian Genome Structure.

    PubMed

    Warren, Wesley C; Hillier, LaDeana W; Tomlinson, Chad; Minx, Patrick; Kremitzki, Milinn; Graves, Tina; Markovic, Chris; Bouk, Nathan; Pruitt, Kim D; Thibaud-Nissen, Francoise; Schneider, Valerie; Mansour, Tamer A; Brown, C Titus; Zimin, Aleksey; Hawken, Rachel; Abrahamsen, Mitch; Pyrkosz, Alexis B; Morisson, Mireille; Fillon, Valerie; Vignal, Alain; Chow, William; Howe, Kerstin; Fulton, Janet E; Miller, Marcia M; Lovell, Peter; Mello, Claudio V; Wirthlin, Morgan; Mason, Andrew S; Kuo, Richard; Burt, David W; Dodgson, Jerry B; Cheng, Hans H

    2017-01-05

    The importance of the Gallus gallus (chicken) as a model organism and agricultural animal merits a continuation of sequence assembly improvement efforts. We present a new version of the chicken genome assembly (Gallus_gallus-5.0; GCA_000002315.3), built from combined long single molecule sequencing technology, finished BACs, and improved physical maps. In overall assembled bases, we see a gain of 183 Mb, including 16.4 Mb in placed chromosomes with a corresponding gain in the percentage of intact repeat elements characterized. Of the 1.21 Gb genome, we include three previously missing autosomes, GGA30, 31, and 33, and improve sequence contig length 10-fold over the previous Gallus_gallus-4.0. Despite the significant base representation improvements made, 138 Mb of sequence is not yet located to chromosomes. When annotated for gene content, Gallus_gallus-5.0 shows an increase of 4679 annotated genes (2768 noncoding and 1911 protein-coding) over those in Gallus_gallus-4.0. We also revisited the question of what genes are missing in the avian lineage, as assessed by the highest quality avian genome assembly to date, and found that a large fraction of the original set of missing genes are still absent in sequenced bird species. Finally, our new data support a detailed map of MHC-B, encompassing two segments: one with a highly stable gene copy number and another in which the gene copy number is highly variable. The chicken model has been a critical resource for many other fields of study, and this new reference assembly will substantially further these efforts. Copyright © 2017 Warren et al.

  4. Comparative genomics of grass EST libraries reveals previously uncharacterized splicing events in crop plants.

    PubMed

    Chuang, Trees-Juen; Yang, Min-Yu; Lin, Chuang-Chieh; Hsieh, Ping-Hung; Hung, Li-Yuan

    2015-02-05

    Crop plants such as rice, maize and sorghum play economically-important roles as main sources of food, fuel, and animal feed. However, current genome annotations of crop plants still suffer false-positive predictions; a more comprehensive registry of alternative splicing (AS) events is also in demand. Comparative genomics of crop plants is largely unexplored. We performed a large-scale comparative analysis (ExonFinder) of the expressed sequence tag (EST) library from nine grass plants against three crop genomes (rice, maize, and sorghum) and identified 2,879 previously-unannotated exons (i.e., novel exons) in the three crops. We validated 81% of the tested exons by RT-PCR-sequencing, supporting the effectiveness of our in silico strategy. Evolutionary analysis reveals that the novel exons, comparing with their flanking annotated ones, are generally under weaker selection pressure at the protein level, but under stronger pressure at the RNA level, suggesting that most of the novel exons also represent novel alternatively spliced variants (ASVs). However, we also observed the consistency of evolutionary rates between certain novel exons and their flanking exons, which provided further evidence of their co-occurrence in the transcripts, suggesting that previously-annotated isoforms might be subject to erroneous predictions. Our validation showed that 54% of the tested genes expressed the newly-identified isoforms that contained the novel exons, rather than the previously-annotated isoforms that excluded them. The consistent results were steadily observed across cultivated (Oryza sativa and O. glaberrima) and wild (O. rufipogon and O. nivara) rice species, asserting the necessity of our curation of the crop genome annotations. Our comparative analyses also inferred the common ancestral transcriptome of grass plants and gain- and loss-of-ASV events. We have reannotated the rice, maize, and sorghum genomes, and showed that evolutionary rates might serve as an indicator for determining whether the identified exons were alternatively spliced. This study not only presents an effective in silico strategy for the improvement of plant annotations, but also provides further insights into the role of AS events in the evolution and domestication of crop plants. ExonFinder and the novel exons/ASVs identified are publicly accessible at http://exonfinder.sourceforge.net/ .

  5. Escherichia coli K-12: a cooperatively developed annotation snapshot—2005

    PubMed Central

    Riley, Monica; Abe, Takashi; Arnaud, Martha B.; Berlyn, Mary K.B.; Blattner, Frederick R.; Chaudhuri, Roy R.; Glasner, Jeremy D.; Horiuchi, Takashi; Keseler, Ingrid M.; Kosuge, Takehide; Mori, Hirotada; Perna, Nicole T.; Plunkett, Guy; Rudd, Kenneth E.; Serres, Margrethe H.; Thomas, Gavin H.; Thomson, Nicholas R.; Wishart, David; Wanner, Barry L.

    2006-01-01

    The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product on the basis of experimental evidence or sequence analysis. Since both kinds of evidence are constantly expanding, no annotation is complete at any moment in time. This is a snapshot analysis based on the most recent genome sequences of two E.coli K-12 bacteria. An accurate and up-to-date description of E.coli K-12 genes is of particular importance to the scientific community because experimentally determined properties of its gene products provide fundamental information for annotation of innumerable genes of other organisms. Availability of the complete genome sequence of two K-12 strains allows comparison of their genotypes and mutant status of alleles. PMID:16397293

  6. Approaches to Fungal Genome Annotation

    PubMed Central

    Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.

    2011-01-01

    Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117

  7. Assembly and comparison of two closely related Brassica napus genomes.

    PubMed

    Bayer, Philipp E; Hurgobin, Bhavna; Golicz, Agnieszka A; Chan, Chon-Kit Kenneth; Yuan, Yuxuan; Lee, HueyTyng; Renton, Michael; Meng, Jinling; Li, Ruiyuan; Long, Yan; Zou, Jun; Bancroft, Ian; Chalhoub, Boulos; King, Graham J; Batley, Jacqueline; Edwards, David

    2017-12-01

    As an increasing number of plant genome sequences become available, it is clear that gene content varies between individuals, and the challenge arises to predict the gene content of a species. However, genome comparison is often confounded by variation in assembly and annotation. Differentiating between true gene absence and variation in assembly or annotation is essential for the accurate identification of conserved and variable genes in a species. Here, we present the de novo assembly of the B. napus cultivar Tapidor and comparison with an improved assembly of the Brassica napus cultivar Darmor-bzh. Both cultivars were annotated using the same method to allow comparison of gene content. We identified genes unique to each cultivar and differentiate these from artefacts due to variation in the assembly and annotation. We demonstrate that using a common annotation pipeline can result in different gene predictions, even for closely related cultivars, and repeat regions which collapse during assembly impact whole genome comparison. After accounting for differences in assembly and annotation, we demonstrate that the genome of Darmor-bzh contains a greater number of genes than the genome of Tapidor. Our results are the first step towards comparison of the true differences between B. napus genomes and highlight the potential sources of error in future production of a B. napus pangenome. © 2017 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  8. The Co-regulation Data Harvester: Automating gene annotation starting from a transcriptome database

    NASA Astrophysics Data System (ADS)

    Tsypin, Lev M.; Turkewitz, Aaron P.

    Identifying co-regulated genes provides a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, a process much slower than genome sequencing per se. Tetrahymena thermophila, a unicellular eukaryote, has been a useful model organism and has a fully sequenced but sparsely annotated genome. One important resource for studying this organism has been an online transcriptomic database. We have developed an automated approach to gene annotation in the context of transcriptome data in T. thermophila, called the Co-regulation Data Harvester (CDH). Beginning with a gene of interest, the CDH identifies co-regulated genes by accessing the Tetrahymena transcriptome database. It then identifies their closely related genes (orthologs) in other organisms by using reciprocal BLAST searches. Finally, it collates the annotations of those orthologs' functions, which provides the user with information to help predict the cellular role of the initial query. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.

  9. Guidelines for the functional annotation of microRNAs using the Gene Ontology

    PubMed Central

    D'Eustachio, Peter; Smith, Jennifer R.; Zampetaki, Anna

    2016-01-01

    MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual). PMID:26917558

  10. Generation, annotation and analysis of ESTs from Trichoderma harzianum CECT 2413

    PubMed Central

    Vizcaíno, Juan Antonio; González, Francisco Javier; Suárez, M Belén; Redondo, José; Heinrich, Julian; Delgado-Jarana, Jesús; Hermosa, Rosa; Gutiérrez, Santiago; Monte, Enrique; Llobell, Antonio; Rey, Manuel

    2006-01-01

    Background The filamentous fungus Trichoderma harzianum is used as biological control agent of several plant-pathogenic fungi. In order to study the genome of this fungus, a functional genomics project called "TrichoEST" was developed to give insights into genes involved in biological control activities using an approach based on the generation of expressed sequence tags (ESTs). Results Eight different cDNA libraries from T. harzianum strain CECT 2413 were constructed. Different growth conditions involving mainly different nutrient conditions and/or stresses were used. We here present the analysis of the 8,710 ESTs generated. A total of 3,478 unique sequences were identified of which 81.4% had sequence similarity with GenBank entries, using the BLASTX algorithm. Using the Gene Ontology hierarchy, we performed the annotation of 51.1% of the unique sequences and compared its distribution among the gene libraries. Additionally, the InterProScan algorithm was used in order to further characterize the sequences. The identification of the putatively secreted proteins was also carried out. Later, based on the EST abundance, we examined the highly expressed genes and a hydrophobin was identified as the gene expressed at the highest level. We compared our collection of ESTs with the previous collections obtained from Trichoderma species and we also compared our sequence set with different complete eukaryotic genomes from several animals, plants and fungi. Accordingly, the presence of similar sequences in different kingdoms was also studied. Conclusion This EST collection and its annotation provide a significant resource for basic and applied research on T. harzianum, a fungus with a high biotechnological interest. PMID:16872539

  11. A Resource of Quantitative Functional Annotation for Homo sapiens Genes.

    PubMed

    Taşan, Murat; Drabkin, Harold J; Beaver, John E; Chua, Hon Nian; Dunham, Julie; Tian, Weidong; Blake, Judith A; Roth, Frederick P

    2012-02-01

    The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.

  12. A comprehensive transcript index of the human genome generated using microarrays and computational approaches

    PubMed Central

    Schadt, Eric E; Edwards, Stephen W; GuhaThakurta, Debraj; Holder, Dan; Ying, Lisa; Svetnik, Vladimir; Leonardson, Amy; Hart, Kyle W; Russell, Archie; Li, Guoya; Cavet, Guy; Castle, John; McDonagh, Paul; Kan, Zhengyan; Chen, Ronghua; Kasarskis, Andrew; Margarint, Mihai; Caceres, Ramon M; Johnson, Jason M; Armour, Christopher D; Garrett-Engele, Philip W; Tsinoremas, Nicholas F; Shoemaker, Daniel D

    2004-01-01

    Background Computational and microarray-based experimental approaches were used to generate a comprehensive transcript index for the human genome. Oligonucleotide probes designed from approximately 50,000 known and predicted transcript sequences from the human genome were used to survey transcription from a diverse set of 60 tissues and cell lines using ink-jet microarrays. Further, expression activity over at least six conditions was more generally assessed using genomic tiling arrays consisting of probes tiled through a repeat-masked version of the genomic sequence making up chromosomes 20 and 22. Results The combination of microarray data with extensive genome annotations resulted in a set of 28,456 experimentally supported transcripts. This set of high-confidence transcripts represents the first experimentally driven annotation of the human genome. In addition, the results from genomic tiling suggest that a large amount of transcription exists outside of annotated regions of the genome and serves as an example of how this activity could be measured on a genome-wide scale. Conclusions These data represent one of the most comprehensive assessments of transcriptional activity in the human genome and provide an atlas of human gene expression over a unique set of gene predictions. Before the annotation of the human genome is considered complete, however, the previously unannotated transcriptional activity throughout the genome must be fully characterized. PMID:15461792

  13. Annotation of the Clostridium Acetobutylicum Genome

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Daly, M. J.

    The genome sequence of the solvent producing bacterium Clostridium acetobutylicum ATCC824, has been determined by the shotgun approach. The genome consists of a 3.94 Mb chromosome and a 192 kb megaplasmid that contains the majority of genes responsible for solvent production. Comparison of C. acetobutylicum to Bacillus subtilis reveals significant local conservation of gene order, which has not been seen in comparisons of other genomes with similar, or, in some cases, closer, phylogenetic proximity. This conservation allows the prediction of many previously undetected operons in both bacteria.

  14. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  15. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis.

    PubMed

    Zhu, Yafeng; Engström, Pär G; Tellgren-Roth, Christian; Baudo, Charles D; Kennell, John C; Sun, Sheng; Billmyre, R Blake; Schröder, Markus S; Andersson, Anna; Holm, Tina; Sigurgeirsson, Benjamin; Wu, Guangxi; Sankaranarayanan, Sundar Ram; Siddharthan, Rahul; Sanyal, Kaustuv; Lundeberg, Joakim; Nystedt, Björn; Boekhout, Teun; Dawson, Thomas L; Heitman, Joseph; Scheynius, Annika; Lehtiö, Janne

    2017-03-17

    Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. Complete genome analysis of Serratia marcescens RSC-14: A plant growth-promoting bacterium that alleviates cadmium stress in host plants

    PubMed Central

    Khan, Abdur Rahim; Park, Gun-Seok; Asaf, Sajjad; Hong, Sung-Jun; Jung, Byung Kwon

    2017-01-01

    Serratia marcescens RSC-14 is a Gram-negative bacterium that was previously isolated from the surface-sterilized roots of the Cd-hyperaccumulator Solanum nigrum. The strain stimulates plant growth and alleviates Cd stress in host plants. To investigate the genetic basis for these traits, the complete genome of RSC-14 was obtained by single-molecule real-time sequencing. The genome of S. marcescens RSC-14 comprised a 5.12-Mbp-long circular chromosome containing 4,593 predicted protein-coding genes, 22 rRNA genes, 88 tRNA genes, and 41 pseudogenes. It contained genes with potential functions in plant growth promotion, including genes involved in indole-3-acetic acid (IAA) biosynthesis, acetoin synthesis, and phosphate solubilization. Moreover, annotation using NCBI and Rapid Annotation using Subsystem Technology identified several genes that encode antioxidant enzymes as well as genes involved in antioxidant production, supporting the observed resistance towards heavy metals, such as Cd. The presence of IAA pathway-related genes and oxidative stress-responsive enzyme genes may explain the plant growth-promoting potential and Cd tolerance, respectively. This is the first report of a complete genome sequence of Cd-tolerant S. marcescens and its plant growth promotion pathway. The whole-genome analysis of this strain clarified the genetic basis underlying its phenotypic and biochemical characteristics, underpinning the beneficial interactions between RSC-14 and plants. PMID:28187139

  17. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  18. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis

    PubMed Central

    Neerincx, Pieter BT; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack AM; Groenen, Martien AM; Klopp, Christophe

    2009-01-01

    Background Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. Results IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines. For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. Conclusion In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation. PMID:19615109

  19. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis.

    PubMed

    Neerincx, Pieter Bt; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack Am; Groenen, Martien Am; Klopp, Christophe

    2009-07-16

    Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines.For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation.

  20. Towards a complete map of the human long non-coding RNA transcriptome.

    PubMed

    Uszczynska-Ratajczak, Barbara; Lagarde, Julien; Frankish, Adam; Guigó, Roderic; Johnson, Rory

    2018-05-23

    Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.

  1. Annotation of Ehux ESTs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  2. Improved annotation with de novo transcriptome assembly in four social amoeba species.

    PubMed

    Singh, Reema; Lawal, Hajara M; Schilde, Christina; Glöckner, Gernot; Barton, Geoffrey J; Schaap, Pauline; Cole, Christian

    2017-01-31

    Annotation of gene models and transcripts is a fundamental step in genome sequencing projects. Often this is performed with automated prediction pipelines, which can miss complex and atypical genes or transcripts. RNA sequencing (RNA-seq) data can aid the annotation with empirical data. Here we present de novo transcriptome assemblies generated from RNA-seq data in four Dictyostelid species: D. discoideum, P. pallidum, D. fasciculatum and D. lacteum. The assemblies were incorporated with existing gene models to determine corrections and improvement on a whole-genome scale. This is the first time this has been performed in these eukaryotic species. An initial de novo transcriptome assembly was generated by Trinity for each species and then refined with Program to Assemble Spliced Alignments (PASA). The completeness and quality were assessed with the Benchmarking Universal Single-Copy Orthologs (BUSCO) and Transrate tools at each stage of the assemblies. The final datasets of 11,315-12,849 transcripts contained 5,610-7,712 updates and corrections to >50% of existing gene models including changes to hundreds or thousands of protein products. Putative novel genes are also identified and alternative splice isoforms were observed for the first time in P. pallidum, D. lacteum and D. fasciculatum. In taking a whole transcriptome approach to genome annotation with empirical data we have been able to enrich the annotations of four existing genome sequencing projects. In doing so we have identified updates to the majority of the gene annotations across all four species under study and found putative novel genes and transcripts which could be worthy for follow-up. The new transcriptome data we present here will be a valuable resource for genome curators in the Dictyostelia and we propose this effective methodology for use in other genome annotation projects.

  3. GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains

    PubMed Central

    Lu, Zhiyong

    2015-01-01

    The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. PMID:26380306

  4. Introducing meta-services for biomedical information extraction

    PubMed Central

    Leitner, Florian; Krallinger, Martin; Rodriguez-Penagos, Carlos; Hakenberg, Jörg; Plake, Conrad; Kuo, Cheng-Ju; Hsu, Chun-Nan; Tsai, Richard Tzong-Han; Hung, Hsi-Chuan; Lau, William W; Johnson, Calvin A; Sætre, Rune; Yoshida, Kazuhiro; Chen, Yan Hua; Kim, Sun; Shin, Soo-Yong; Zhang, Byoung-Tak; Baumgartner, William A; Hunter, Lawrence; Haddow, Barry; Matthews, Michael; Wang, Xinglong; Ruch, Patrick; Ehrler, Frédéric; Özgür, Arzucan; Erkan, Güneş; Radev, Dragomir R; Krauthammer, Michael; Luong, ThaiBinh; Hoffmann, Robert; Sander, Chris; Valencia, Alfonso

    2008-01-01

    We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; ). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations. PMID:18834497

  5. Prediction of gene expression in embryonic structures of Drosophila melanogaster.

    PubMed

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-07-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.

  6. Prediction of Gene Expression in Embryonic Structures of Drosophila melanogaster

    PubMed Central

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-01-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms. PMID:17658945

  7. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  8. A graph-based semantic similarity measure for the gene ontology.

    PubMed

    Alvarez, Marco A; Yan, Changhui

    2011-12-01

    Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

  9. Functional Annotations of Paralogs: A Blessing and a Curse

    PubMed Central

    Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  10. GoGene: gene annotation in the fast lane.

    PubMed

    Plake, Conrad; Royer, Loic; Winnenburg, Rainer; Hakenberg, Jörg; Schroeder, Michael

    2009-07-01

    High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.

  11. Inspection of the grapevine BURP superfamily highlights an expansion of RD22 genes with distinctive expression features in berry development and ABA-mediated stress responses.

    PubMed

    Matus, José Tomás; Aquea, Felipe; Espinoza, Carmen; Vega, Andrea; Cavallini, Erika; Dal Santo, Silvia; Cañón, Paola; Rodríguez-Hoces de la Guardia, Amparo; Serrano, Jennifer; Tornielli, Giovanni Battista; Arce-Johnson, Patricio

    2014-01-01

    The RESPONSIVE TO DEHYDRATION 22 (RD22) gene is a molecular link between abscisic acid (ABA) signalling and abiotic stress responses. Its expression has been used as a reliable ABA early response marker. In Arabidopsis, the single copy RD22 gene possesses a BURP domain also located at the C-terminus of USP embryonic proteins and the beta subunit of polygalacturonases. In grapevine, a RD22 gene has been identified but putative paralogs are also found in the grape genome, possibly forming a large RD22 family in this species. In this work, we searched for annotations containing BURP domains in the Vitis vinifera genome. Nineteen proteins were defined by a comparative analysis between the two genome predictions and RNA-Seq data. These sequences were compared to other plant BURPs identified in previous genome surveys allowing us to reconceive group classifications based on phylogenetic relationships and protein motif occurrence. We observed a lineage-specific evolution of the RD22 family, with the biggest expansion in grapevine and poplar. In contrast, rice, sorghum and maize presented highly expanded monocot-specific groups. The Vitis RD22 group may have expanded from segmental duplications as most of its members are confined to a region in chromosome 4. The inspection of transcriptomic data revealed variable expression of BURP genes in vegetative and reproductive organs. Many genes were induced in specific tissues or by abiotic and biotic stresses. Three RD22 genes were further studied showing that they responded oppositely to ABA and to stress conditions. Our results show that the inclusion of RNA-Seq data is essential while describing gene families and improving gene annotations. Robust phylogenetic analyses including all BURP members from other sequenced species helped us redefine previous relationships that were erroneously established. This work provides additional evidence for RD22 genes serving as marker genes for different organs or stresses in grapevine.

  12. GENCODE: the reference human genome annotation for The ENCODE Project.

    PubMed

    Harrow, Jennifer; Frankish, Adam; Gonzalez, Jose M; Tapanari, Electra; Diekhans, Mark; Kokocinski, Felix; Aken, Bronwen L; Barrell, Daniel; Zadissa, Amonida; Searle, Stephen; Barnes, If; Bignell, Alexandra; Boychenko, Veronika; Hunt, Toby; Kay, Mike; Mukherjee, Gaurab; Rajan, Jeena; Despacio-Reyes, Gloria; Saunders, Gary; Steward, Charles; Harte, Rachel; Lin, Michael; Howald, Cédric; Tanzer, Andrea; Derrien, Thomas; Chrast, Jacqueline; Walters, Nathalie; Balasubramanian, Suganthi; Pei, Baikang; Tress, Michael; Rodriguez, Jose Manuel; Ezkurdia, Iakes; van Baren, Jeltje; Brent, Michael; Haussler, David; Kellis, Manolis; Valencia, Alfonso; Reymond, Alexandre; Gerstein, Mark; Guigó, Roderic; Hubbard, Tim J

    2012-09-01

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

  13. Identification of functional elements and regulatory circuits by Drosophila modENCODE

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roy, Sushmita; Ernst, Jason; Kharchenko, Peter V.

    2010-12-22

    To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- andmore » tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation. Several years after the complete genetic sequencing of many species, it is still unclear how to translate genomic information into a functional map of cellular and developmental programs. The Encyclopedia of DNA Elements (ENCODE) (1) and model organism ENCODE (modENCODE) (2) projects use diverse genomic assays to comprehensively annotate the Homo sapiens (human), Drosophila melanogaster (fruit fly), and Caenorhabditis elegans (worm) genomes, through systematic generation and computational integration of functional genomic data sets. Previous genomic studies in flies have made seminal contributions to our understanding of basic biological mechanisms and genome functions, facilitated by genetic, experimental, computational, and manual annotation of the euchromatic and heterochromatic genome (3), small genome size, short life cycle, and a deep knowledge of development, gene function, and chromosome biology. The functions of {approx}40% of the protein and nonprotein-coding genes [FlyBase 5.12 (4)] have been determined from cDNA collections (5, 6), manual curation of gene models (7), gene mutations and comprehensive genome-wide RNA interference screens (8-10), and comparative genomic analyses (11, 12). The Drosophila modENCODE project has generated more than 700 data sets that profile transcripts, histone modifications and physical nucleosome properties, general and specific transcription factors (TFs), and replication programs in cell lines, isolated tissues, and whole organisms across several developmental stages (Fig. 1). Here, we computationally integrate these data sets and report (i) improved and additional genome annotations, including full-length proteincoding genes and peptides as short as 21 amino acids; (ii) noncoding transcripts, including 132 candidate structural RNAs and 1608 nonstructural transcripts; (iii) additional Argonaute (Ago)-associated small RNA genes and pathways, including new microRNAs (miRNAs) encoded within protein-coding exons and endogenous small interfering RNAs (siRNAs) from 3-inch untranslated regions; (iv) chromatin 'states' defined by combinatorial patterns of 18 chromatin marks that are associated with distinct functions and properties; (v) regions of high TF occupancy and replication activity with likely epigenetic regulation; (vi)mixed TF and miRNA regulatory networks with hierarchical structure and enriched feed-forward loops; (vii) coexpression- and co-regulation-based functional annotations for nearly 3000 genes; (viii) stage- and tissue-specific regulators; and (ix) predictive models of gene expression levels and regulator function.« less

  14. nGASP--the nematode genome annotation assessment project.

    PubMed

    Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.

  15. Sequence- and Structure-Based Functional Annotation and Assessment of Metabolic Transporters in Aspergillus oryzae: A Representative Case Study

    PubMed Central

    Raethong, Nachon; Wong-ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa

    2016-01-01

    Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H+-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction. PMID:27274991

  16. Sequence- and Structure-Based Functional Annotation and Assessment of Metabolic Transporters in Aspergillus oryzae: A Representative Case Study.

    PubMed

    Raethong, Nachon; Wong-Ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa

    2016-01-01

    Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H(+)-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction.

  17. A genome scale metabolic network for rice and accompanying analysis of tryptophan, auxin and serotonin biosynthesis regulation under biotic stress

    USDA-ARS?s Scientific Manuscript database

    Functional annotations of large plant genome projects mostly provide information on gene function and gene families based on the presence of protein domains and gene homology, but not necessarily in association with gene expression or metabolic and regulatory networks. These additional annotations a...

  18. A new family of β-helix proteins with similarities to the polysaccharide lyases

    DOE PAGES

    Close, Devin W.; D'Angelo, Sara; Bradbury, Andrew R. M.

    2014-09-27

    Microorganisms that degrade biomass produce diverse assortments of carbohydrate-active enzymes and binding modules. Despite tremendous advances in the genomic sequencing of these organisms, many genes do not have an ascribed function owing to low sequence identity to genes that have been annotated. Consequently, biochemical and structural characterization of genes with unknown function is required to complement the rapidly growing pool of genomic sequencing data. A protein with previously unknown function (Cthe_2159) was recently isolated in a genome-wide screen using phage display to identify cellulose-binding protein domains from the biomass-degrading bacterium Clostridium thermocellum. Here, the crystal structure of Cthe_2159 is presentedmore » and it is shown that it is a unique right-handed parallel β-helix protein. Despite very low sequence identity to known β-helix or carbohydrate-active proteins, Cthe_2159 displays structural features that are very similar to those of polysaccharide lyase (PL) families 1, 3, 6 and 9. Cthe_2159 is conserved across bacteria and some archaea and is a member of the domain of unknown function family DUF4353. This suggests that Cthe_2159 is the first representative of a previously unknown family of cellulose and/or acid-sugar binding β-helix proteins that share structural similarities with PLs. More importantly, these results demonstrate how functional annotation by biochemical and structural analysis remains a critical tool in the characterization of new gene products.« less

  19. A new family of β-helix proteins with similarities to the polysaccharide lyases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Close, Devin W.; D'Angelo, Sara; Bradbury, Andrew R. M.

    Microorganisms that degrade biomass produce diverse assortments of carbohydrate-active enzymes and binding modules. Despite tremendous advances in the genomic sequencing of these organisms, many genes do not have an ascribed function owing to low sequence identity to genes that have been annotated. Consequently, biochemical and structural characterization of genes with unknown function is required to complement the rapidly growing pool of genomic sequencing data. A protein with previously unknown function (Cthe_2159) was recently isolated in a genome-wide screen using phage display to identify cellulose-binding protein domains from the biomass-degrading bacterium Clostridium thermocellum. Here, the crystal structure of Cthe_2159 is presentedmore » and it is shown that it is a unique right-handed parallel β-helix protein. Despite very low sequence identity to known β-helix or carbohydrate-active proteins, Cthe_2159 displays structural features that are very similar to those of polysaccharide lyase (PL) families 1, 3, 6 and 9. Cthe_2159 is conserved across bacteria and some archaea and is a member of the domain of unknown function family DUF4353. This suggests that Cthe_2159 is the first representative of a previously unknown family of cellulose and/or acid-sugar binding β-helix proteins that share structural similarities with PLs. More importantly, these results demonstrate how functional annotation by biochemical and structural analysis remains a critical tool in the characterization of new gene products.« less

  20. The developmental transcriptome of Drosophila melanogaster

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    University of Connecticut; Graveley, Brenton R.; Brooks, Angela N.

    Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, predictionmore » and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development. Drosophila melanogaster is an important non-mammalian model system that has had a critical role in basic biological discoveries, such as identifying chromosomes as the carriers of genetic information and uncovering the role of genes in development. Because it shares a substantial genic content with humans, Drosophila is increasingly used as a translational model for human development, homeostasis and disease. High-quality maps are needed for all functional genomic elements. Previous studies demonstrated that a rich collection of genes is deployed during the life cycle of the fly. Although expression profiling using microarrays has revealed the expression of, 13,000 annotated genes, it is difficult to map splice junctions and individual base modifications generated by RNA editing using such approaches. Single-base resolution is essential to define precisely the elements that comprise the Drosophila transcriptome. Estimates of the number of transcript isoforms are less accurate than estimates of the number of genes. Whereas, 20% of Drosophila genes are annotated as encoding alternatively spliced premRNAs, splice-junction microarray experiments indicate that this number is at least 40% (ref. 7). Determining the diversity of mRNAs generated by alternative promoters, alternative splicing and RNA editing will substantially increase the inferred protein repertoire. Non-coding RNA genes (ncRNAs) including short interfering RNAs (siRNAs) and microRNAS (miRNAs) (reviewed in ref. 10), and longer ncRNAs such as bxd (ref. 11) and rox (ref. 12), have important roles in gene regulation, whereas others such as small nucleolar RNAs (snoRNAs)and small nuclear RNAs (snRNAs) are important components of macromolecular machines such as the ribosome and spliceosome. The transcription and processing of these ncRNAs must also be fully documented and mapped. As part of the modENCODE project to annotate the functional elements of the D. melanogaster and Caenorhabditis elegans genomes, we used RNA-Seq and tiling microarrays to sample the Drosophila transcriptome at unprecedented depth throughout development from early embryo to ageing male and female adults. We report on a high-resolution view of the discovery, structure and dynamic expression of the D. melanogaster transcriptome.« less

  1. Improved Annotation of 3′ Untranslated Regions and Complex Loci by Combination of Strand-Specific Direct RNA Sequencing, RNA-Seq and ESTs

    PubMed Central

    Song, Junfang; Duc, Céline; Storey, Kate G.; McLean, W. H. Irwin; Brown, Sara J.; Simpson, Gordon G.; Barton, Geoffrey J.

    2014-01-01

    The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct and complete annotation in addition to the underlying genomic sequence is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3′ untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3′ polyadenylation sites to within +/− 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3′ UTR re-annotation (including extension of one 3′ UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data. PMID:24722185

  2. High-Throughput Sequence Analysis of Turbot (Scophthalmus maximus) Transcriptome Using 454-Pyrosequencing for the Discovery of Antiviral Immune Genes

    PubMed Central

    Pereiro, Patricia; Balseiro, Pablo; Romero, Alejandro; Dios, Sonia; Forn-Cuni, Gabriel; Fuste, Berta; Planas, Josep V.; Beltran, Sergi; Novoa, Beatriz; Figueras, Antonio

    2012-01-01

    Background Turbot (Scophthalmus maximus L.) is an important aquacultural resource both in Europe and Asia. However, there is little information on gene sequences available in public databases. Currently, one of the main problems affecting the culture of this flatfish is mortality due to several pathogens, especially viral diseases which are not treatable. In order to identify new genes involved in immune defense, we conducted 454-pyrosequencing of the turbot transcriptome after different immune stimulations. Methodology/Principal Findings Turbot were injected with viral stimuli to increase the expression level of immune-related genes. High-throughput deep sequencing using 454-pyrosequencing technology yielded 915,256 high-quality reads. These sequences were assembled into 55,404 contigs that were subjected to annotation steps. Intriguingly, 55.16% of the deduced protein was not significantly similar to any sequences in the databases used for the annotation and only 0.85% of the BLASTx top-hits matched S. maximus protein sequences. This relatively low level of annotation is possibly due to the limited information for this specie and other flatfish in the database. These results suggest the identification of a large number of new genes in turbot and in fish in general. A more detailed analysis showed the presence of putative members of several innate and specific immune pathways. Conclusions/Significance To our knowledge, this study is the first transcriptome analysis using 454-pyrosequencing for turbot. Previously, there were only 12,471 EST and less of 1,500 nucleotide sequences for S. maximus in NCBI database. Our results provide a rich source of data (55,404 contigs and 181,845 singletons) for discovering and identifying new genes, which will serve as a basis for microarray construction, gene expression characterization and for identification of genetic markers to be used in several applications. Immune stimulation in turbot was very effective, obtaining an enormous variety of sequences belonging to genes involved in the defense mechanisms. PMID:22629298

  3. Neuroprotective changes of thalamic degeneration-related gene expression by acupuncture in an MPTP mouse model of parkinsonism: microarray analysis.

    PubMed

    Yeo, Sujung; Choi, Yeong-Gon; Hong, Yeon-Mi; Lim, Sabina

    2013-02-25

    Acupuncture stimulations at GB34 and LR3 inhibit the reduction of tyrosine hydroxylase in the nigrostriatal dopaminergic neurons in the parkinsonism animal models. Especially, behavioral tests showed that acupuncture stimulations improved the motor dysfunction in a previous study by almost 87.7%. The thalamus is a crucial area for the motor circuit and has been identified as one of the most markedly damaged areas in Parkinson's disease (PD), so acupuncture stimulations might also have an effect on the thalamic damage. In this study, gene expression changes following acupuncture at the acupoints were investigated in the thalamus of a 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP)-induced parkinsonism model using a whole transcript array. It was confirmed that acupuncture at these acupoints could inhibit the decrease of tyrosine hydroxylase in the thalamic regions of the MPTP model, while acupuncture at the non-acupoints could not suppress this decrease by its level shown in the acupoints. GeneChip gene array analysis showed that 18 (5 annotated genes: Dnase1l2, Dusp4, Mafg, Ndph and Pgm5) of the probes down-regulated in MPTP, as compared to the control, were exclusively up-regulated by acupuncture at the acupoints, but not at the non-acupoints. In addition, 14 (3 annotated genes; Serinc2, Sp2 and Ucp2) of the probes up-regulated in MPTP, as compared to the control, were exclusively down-regulated by acupuncture at the acupoints, but not at the non-acupoints. The expression levels of the representative genes in the microarray were validated by real-time RT-PCR. These results suggest that the 32 probes (8 annotated genes) which are affected by MPTP and acupuncture may be responsible for exerting the inhibitory effect of acupuncture in the thalamus which can be damaged by MPTP intoxication. Copyright © 2012 Elsevier B.V. All rights reserved.

  4. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  5. APPRIS: annotation of principal and alternative splice isoforms

    PubMed Central

    Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L.

    2013-01-01

    Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform. PMID:23161672

  6. MorphDB: Prioritizing Genes for Specialized Metabolism Pathways and Gene Ontology Categories in Plants.

    PubMed

    Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren

    2018-01-01

    Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.

  7. Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae

    PubMed Central

    Meng, Shaowu; Brown, Douglas E; Ebbole, Daniel J; Torto-Alalibo, Trudy; Oh, Yeon Yee; Deng, Jixin; Mitchell, Thomas K; Dean, Ralph A

    2009-01-01

    Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System. Conclusion Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae. PMID:19278556

  8. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

    PubMed Central

    Thomas, Paul D.; Wood, Valerie; Mungall, Christopher J.; Lewis, Suzanna E.; Blake, Judith A.

    2012-01-01

    A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis. PMID:22359495

  9. The Representation of Heart Development in the Gene Ontology

    PubMed Central

    Khodiyar, Varsha K.; Hill, David P.; Howe, Doug; Berardini, Tanya Z.; Tweedie, Susan; Talmud, Philippa J.; Breckenridge, Ross; Bhattarcharya, Shoumo; Riley, Paul; Scambler, Peter; Lovering, Ruth C.

    2012-01-01

    An understanding of heart development is critical in any systems biology approach to cardiovascular disease. The interpretation of data generated from high-throughput technologies (such as microarray and proteomics) is also essential to this approach. However, characterizing the role of genes in the processes underlying heart development and cardiovascular disease involves the non-trivial task of data analysis and integration of previous knowledge. The Gene Ontology (GO) Consortium provides structured controlled biological vocabularies that are used to summarize previous functional knowledge for gene products across all species. One aspect of GO describes biological processes, such as development and signaling. In order to support high-throughput cardiovascular research, we have initiated an effort to fully describe heart development in GO; expanding the number of GO terms describing heart development from 12 to over 280. This new ontology describes heart morphogenesis, the differentiation of specific cardiac cell types, and the involvement of signaling pathways in heart development and aligns GO with the current views of the heart development research community and its representation in the literature. This extension of GO allows gene product annotators to comprehensively capture the genetic program leading to the developmental progression of the heart. This will enable users to integrate heart development data across species, resulting in the comprehensive retrieval of information about this subject. The revised GO structure, combined with gene product annotations, should improve the interpretation of data from high-throughput methods in a variety of cardiovascular research areas, including heart development, congenital cardiac disease, and cardiac stem cell research. Additionally, we invite the heart development community to contribute to the expansion of this important dataset for the benefit of future research in this area. PMID:21419760

  10. RNA-Seq of Arabidopsis Pollen Uncovers Novel Transcription and Alternative Splicing1[C][W][OA

    PubMed Central

    Loraine, Ann E.; McCormick, Sheila; Estrada, April; Patel, Ketan; Qin, Peng

    2013-01-01

    Pollen grains of Arabidopsis (Arabidopsis thaliana) contain two haploid sperm cells enclosed in a haploid vegetative cell. Upon germination, the vegetative cell extrudes a pollen tube that carries the sperm to an ovule for fertilization. Knowing the identity, relative abundance, and splicing patterns of pollen transcripts will improve our understanding of pollen and allow investigation of tissue-specific splicing in plants. Most Arabidopsis pollen transcriptome studies have used the ATH1 microarray, which does not assay splice variants and lacks specific probe sets for many genes. To investigate the pollen transcriptome, we performed high-throughput sequencing (RNA-Seq) of Arabidopsis pollen and seedlings for comparison. Gene expression was more diverse in seedling, and genes involved in cell wall biogenesis were highly expressed in pollen. RNA-Seq detected at least 4,172 protein-coding genes expressed in pollen, including 289 assayed only by nonspecific probe sets. Additional exons and previously unannotated 5′ and 3′ untranslated regions for pollen-expressed genes were revealed. We detected regions in the genome not previously annotated as expressed; 14 were tested and 12 were confirmed by polymerase chain reaction. Gapped read alignments revealed 1,908 high-confidence new splicing events supported by 10 or more spliced read alignments. Alternative splicing patterns in pollen and seedling were highly correlated. For most alternatively spliced genes, the ratio of variants in pollen and seedling was similar, except for some encoding proteins involved in RNA splicing. This study highlights the robustness of splicing patterns in plants and the importance of ongoing annotation and visualization of RNA-Seq data using interactive tools such as Integrated Genome Browser. PMID:23590974

  11. Simultaneous gene finding in multiple genomes.

    PubMed

    König, Stefanie; Romoth, Lars W; Gerischer, Lizzy; Stanke, Mario

    2016-11-15

    As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  12. Maize GO annotation—methods, evaluation, and review (maize-GAMER)

    USDA-ARS?s Scientific Manuscript database

    We created a new high-coverage, robust, and reproducible functional annotation of maize protein-coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein-coding genes, respectively, this stu...

  13. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus.

    PubMed

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen; Arnqvist, Göran

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally.

  14. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus

    PubMed Central

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally. PMID:27442123

  15. The Association of Multiple Interacting Genes with Specific Phenotypes in Rice Using Gene Coexpression Networks1[C][W][OA

    PubMed Central

    Ficklin, Stephen P.; Luo, Feng; Feltus, F. Alex

    2010-01-01

    Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes. PMID:20668062

  16. The association of multiple interacting genes with specific phenotypes in rice using gene coexpression networks.

    PubMed

    Ficklin, Stephen P; Luo, Feng; Feltus, F Alex

    2010-09-01

    Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes.

  17. Analysis of transcriptomes of three orb-web spider species reveals gene profiles involved in silk and toxin.

    PubMed

    Zhao, Ying-Jun; Zeng, Yan; Chen, Lei; Dong, Yang; Wang, Wen

    2014-12-01

    As an ancient arthropod with a history of 390 million years, spiders evolved numerous morphological forms resulting from adaptation to different environments. The venom and silk of spiders, which have promising commercial applications in agriculture, medicine and engineering fields, are of special interests to researchers. However, little is known about their genomic components, which hinders not only understanding spider biology but also utilizing their valuable genes. Here we report on deep sequenced and de novo assembled transcriptomes of three orb-web spider species, Gasteracantha arcuata, Nasoonaria sinensis and Gasteracantha hasselti which are distributed in tropical forests of south China. With Illumina paired-end RNA-seq technology, 54 871, 101 855 and 75 455 unigenes for the three spider species were obtained, respectively, among which 9 300, 10 001 and 10 494 unique genes are annotated, respectively. From these annotated unigenes, we comprehensively analyzed silk and toxin gene components and structures for the three spider species. Our study provides valuable transcriptome data for three spider species which previously lacked any genetic/genomic data. The results have laid the first fundamental genomic basis for exploiting gene resources from these spiders. © 2013 Institute of Zoology, Chinese Academy of Sciences.

  18. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python)

    PubMed Central

    Rutllant, Josep

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value. PMID:27200191

  19. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python).

    PubMed

    Irizarry, Kristopher J L; Rutllant, Josep

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value.

  20. Creating reference gene annotation for the mouse C57BL6/J genome assembly.

    PubMed

    Mudge, Jonathan M; Harrow, Jennifer

    2015-10-01

    Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.

  1. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  2. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  3. Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae

    PubMed Central

    Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim; Krogsgaard, Steen; Nielsen, Jens

    2008-01-01

    Background Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading to a genome scale metabolic model of A. oryzae. Results Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources. Conclusion A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology. PMID:18500999

  4. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array

    PubMed Central

    Gardina, Paul J; Clark, Tyson A; Shimada, Brian; Staples, Michelle K; Yang, Qing; Veitch, James; Schweitzer, Anthony; Awad, Tarif; Sugnet, Charles; Dee, Suzanne; Davies, Christopher; Williams, Alan; Turpaz, Yaron

    2006-01-01

    Background Alternative splicing is a mechanism for increasing protein diversity by excluding or including exons during post-transcriptional processing. Alternatively spliced proteins are particularly relevant in oncology since they may contribute to the etiology of cancer, provide selective drug targets, or serve as a marker set for cancer diagnosis. While conventional identification of splice variants generally targets individual genes, we present here a new exon-centric array (GeneChip Human Exon 1.0 ST) that allows genome-wide identification of differential splice variation, and concurrently provides a flexible and inclusive analysis of gene expression. Results We analyzed 20 paired tumor-normal colon cancer samples using a microarray designed to detect over one million putative exons that can be virtually assembled into potential gene-level transcripts according to various levels of prior supporting evidence. Analysis of high confidence (empirically supported) transcripts identified 160 differentially expressed genes, with 42 genes occupying a network impacting cell proliferation and another twenty nine genes with unknown functions. A more speculative analysis, including transcripts based solely on computational prediction, produced another 160 differentially expressed genes, three-fourths of which have no previous annotation. We also present a comparison of gene signal estimations from the Exon 1.0 ST and the U133 Plus 2.0 arrays. Novel splicing events were predicted by experimental algorithms that compare the relative contribution of each exon to the cognate transcript intensity in each tissue. The resulting candidate splice variants were validated with RT-PCR. We found nine genes that were differentially spliced between colon tumors and normal colon tissues, several of which have not been previously implicated in cancer. Top scoring candidates from our analysis were also found to substantially overlap with EST-based bioinformatic predictions of alternative splicing in cancer. Conclusion Differential expression of high confidence transcripts correlated extremely well with known cancer genes and pathways, suggesting that the more speculative transcripts, largely based solely on computational prediction and mostly with no previous annotation, might be novel targets in colon cancer. Five of the identified splicing events affect mediators of cytoskeletal organization (ACTN1, VCL, CALD1, CTTN, TPM1), two affect extracellular matrix proteins (FN1, COL6A3) and another participates in integrin signaling (SLC3A2). Altogether they form a pattern of colon-cancer specific alterations that may particularly impact cell motility. PMID:17192196

  5. Neurocarta: aggregating and sharing disease-gene relations for the neurosciences.

    PubMed

    Portales-Casamar, Elodie; Ch'ng, Carolyn; Lui, Frances; St-Georges, Nicolas; Zoubarev, Anton; Lai, Artemis Y; Lee, Mark; Kwok, Cathy; Kwok, Willie; Tseng, Luchia; Pavlidis, Paul

    2013-02-26

    Understanding the genetic basis of diseases is key to the development of better diagnoses and treatments. Unfortunately, only a small fraction of the existing data linking genes to phenotypes is available through online public resources and, when available, it is scattered across multiple access tools. Neurocarta is a knowledgebase that consolidates information on genes and phenotypes across multiple resources and allows tracking and exploring of the associations. The system enables automatic and manual curation of evidence supporting each association, as well as user-enabled entry of their own annotations. Phenotypes are recorded using controlled vocabularies such as the Disease Ontology to facilitate computational inference and linking to external data sources. The gene-to-phenotype associations are filtered by stringent criteria to focus on the annotations most likely to be relevant. Neurocarta is constantly growing and currently holds more than 30,000 lines of evidence linking over 7,000 genes to 2,000 different phenotypes. Neurocarta is a one-stop shop for researchers looking for candidate genes for any disorder of interest. In Neurocarta, they can review the evidence linking genes to phenotypes and filter out the evidence they're not interested in. In addition, researchers can enter their own annotations from their experiments and analyze them in the context of existing public annotations. Neurocarta's in-depth annotation of neurodevelopmental disorders makes it a unique resource for neuroscientists working on brain development.

  6. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts.

    PubMed

    Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W; George, Ebenezer O; Homayouni, Ramin

    2011-04-14

    High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. GCAT is freely available at http://binf1.memphis.edu/gcat.

  7. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud[OPEN

    PubMed Central

    Merchant, Nirav

    2016-01-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  8. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud.

    PubMed

    Duvick, Jon; Standage, Daniel S; Merchant, Nirav; Brendel, Volker P

    2016-04-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. © 2016 American Society of Plant Biologists. All rights reserved.

  9. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism. PMID:26196387

  10. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production.

    PubMed

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism.

  11. EGASP: the human ENCODE Genome Annotation Assessment Project

    PubMed Central

    Guigó, Roderic; Flicek, Paul; Abril, Josep F; Reymond, Alexandre; Lagarde, Julien; Denoeud, France; Antonarakis, Stylianos; Ashburner, Michael; Bajic, Vladimir B; Birney, Ewan; Castelo, Robert; Eyras, Eduardo; Ucla, Catherine; Gingeras, Thomas R; Harrow, Jennifer; Hubbard, Tim; Lewis, Suzanna E; Reese, Martin G

    2006-01-01

    Background We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusion This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence. PMID:16925836

  12. Using GO-WAR for mining cross-ontology weighted association rules.

    PubMed

    Agapito, Giuseppe; Cannataro, Mario; Guzzi, Pietro Hiram; Milano, Marianna

    2015-07-01

    The Gene Ontology (GO) is a structured repository of concepts (GO terms) that are associated to one or more gene products. The process of association is referred to as annotation. The relevance and the specificity of both GO terms and annotations are evaluated by a measure defined as information content (IC). The analysis of annotated data is thus an important challenge for bioinformatics. There exist different approaches of analysis. From those, the use of association rules (AR) may provide useful knowledge, and it has been used in some applications, e.g. improving the quality of annotations. Nevertheless classical association rules algorithms do not take into account the source of annotation nor the importance yielding to the generation of candidate rules with low IC. This paper presents GO-WAR (Gene Ontology-based Weighted Association Rules) a methodology for extracting weighted association rules. GO-WAR can extract association rules with a high level of IC without loss of support and confidence from a dataset of annotated data. A case study on using of GO-WAR on publicly available GO annotation datasets is used to demonstrate that our method outperforms current state of the art approaches. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  13. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) ormore » individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual annotators from Europe and the USA). Olivier Vallon has been most active in continued input of annotation information.« less

  14. Gene annotation from scientific literature using mappings between keyword systems.

    PubMed

    Pérez, Antonio J; Perez-Iratxeta, Carolina; Bork, Peer; Thode, Guillermo; Andrade, Miguel A

    2004-09-01

    The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%). The algorithm can be publicly accessed and used for the annotation of sequences through a web server at http://www.bork.embl.de/kat

  15. Genic insights from integrated human proteomics in GeneCards.

    PubMed

    Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron

    2016-01-01

    GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.

  16. Towards an informative mutant phenotype for every bacterial gene

    DOE PAGES

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; ...

    2014-08-11

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less

  17. APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data.

    PubMed

    Ye, Congting; Long, Yuqi; Ji, Guoli; Li, Qingshun Quinn; Wu, Xiaohui

    2018-06-01

    Alternative polyadenylation (APA) has been increasingly recognized as a crucial mechanism that contributes to transcriptome diversity and gene expression regulation. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great interest to leverage such unprecedented collection of RNA-seq data by new computational methods to extract and quantify APA dynamics in these transcriptomes. However, research progress in this area has been relatively limited. Conventional methods rely on either transcript assembly to determine transcript 3' ends or annotated poly(A) sites. Moreover, they can neither identify more than two poly(A) sites in a gene nor detect dynamic APA site usage considering more than two poly(A) sites. We developed an approach called APAtrap based on the mean squared error model to identify and quantify APA sites from RNA-seq data. APAtrap is capable of identifying novel 3' UTRs and 3' UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving genome annotations. APAtrap also aims to tally all potential poly(A) sites and detect genes with differential APA site usages between conditions. Extensive comparisons of APAtrap with two other latest methods, ChangePoint and DaPars, using various RNA-seq datasets from simulation studies, human and Arabidopsis demonstrate the efficacy and flexibility of APAtrap for any organisms with an annotated genome. Freely available for download at https://apatrap.sourceforge.io. liqq@xmu.edu.cn or xhuister@xmu.edu.cn. Supplementary data are available at Bioinformatics online.

  18. Xander: employing a novel method for efficient gene-targeted metagenomic assembly

    DOE PAGES

    Wang, Qiong; Fish, Jordan A.; Gilman, Mariah; ...

    2015-08-05

    Here, metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility ofmore » this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. In conclusion, xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines.« less

  19. Fast gene ontology based clustering for microarray experiments.

    PubMed

    Ovaska, Kristian; Laakso, Marko; Hautaniemi, Sampsa

    2008-11-21

    Analysis of a microarray experiment often results in a list of hundreds of disease-associated genes. In order to suggest common biological processes and functions for these genes, Gene Ontology annotations with statistical testing are widely used. However, these analyses can produce a very large number of significantly altered biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses. We present fast software for advanced gene annotation using semantic similarity for Gene Ontology terms combined with clustering and heat map visualisation. The methodology allows rapid identification of genes sharing the same Gene Ontology cluster. Our R based semantic similarity open-source package has a speed advantage of over 2000-fold compared to existing implementations. From the resulting hierarchical clustering dendrogram genes sharing a GO term can be identified, and their differences in the gene expression patterns can be seen from the heat map. These methods facilitate advanced annotation of genes resulting from data analysis.

  20. Functional sequencing read annotation for high precision microbiome analysis

    PubMed Central

    Zhu, Chengsheng; Miller, Maximilian; Marpaka, Srinayani; Vaysberg, Pavel; Rühlemann, Malte C; Wu, Guojun; Heinsen, Femke-Anouska; Tempel, Marie; Zhao, Liping; Lieb, Wolfgang; Franke, Andre; Bromberg, Yana

    2018-01-01

    Abstract The vast majority of microorganisms on Earth reside in often-inseparable environment-specific communities—microbiomes. Meta-genomic/-transcriptomic sequencing could reveal the otherwise inaccessible functionality of microbiomes. However, existing analytical approaches focus on attributing sequencing reads to known genes/genomes, often failing to make maximal use of available data. We created faser (functional annotation of sequencing reads), an algorithm that is optimized to map reads to molecular functions encoded by the read-correspondent genes. The mi-faser microbiome analysis pipeline, combining faser with our manually curated reference database of protein functions, accurately annotates microbiome molecular functionality. mi-faser’s minutes-per-microbiome processing speed is significantly faster than that of other methods, allowing for large scale comparisons. Microbiome function vectors can be compared between different conditions to highlight environment-specific and/or time-dependent changes in functionality. Here, we identified previously unseen oil degradation-specific functions in BP oil-spill data, as well as functional signatures of individual-specific gut microbiome responses to a dietary intervention in children with Prader–Willi syndrome. Our method also revealed variability in Crohn's Disease patient microbiomes and clearly distinguished them from those of related healthy individuals. Our analysis highlighted the microbiome role in CD pathogenicity, demonstrating enrichment of patient microbiomes in functions that promote inflammation and that help bacteria survive it. PMID:29194524

  1. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment

    PubMed Central

    Kann, Maricel G.; Sheetlin, Sergey L.; Park, Yonil; Bryant, Stephen H.; Spouge, John L.

    2007-01-01

    The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. PMID:17596268

  2. nGASP - the nematode genome annotation assessment project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coghlan, A; Fiedler, T J; McKay, S J

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less

  4. The annotation of repetitive elements in the genome of channel catfish (Ictalurus punctatus).

    PubMed

    Yuan, Zihao; Zhou, Tao; Bao, Lisui; Liu, Shikai; Shi, Huitong; Yang, Yujia; Gao, Dongya; Dunham, Rex; Waldbieser, Geoff; Liu, Zhanjiang

    2018-01-01

    Channel catfish (Ictalurus punctatus) is a highly adaptive species and has been used as a research model for comparative immunology, physiology, and toxicology among ectothermic vertebrates. It is also economically important for aquaculture. As such, its reference genome was generated and annotated with protein coding genes. However, the repetitive elements in the catfish genome are less well understood. In this study, over 417.8 Megabase (MB) of repetitive elements were identified and characterized in the channel catfish genome. Among them, the DNA/TcMar-Tc1 transposons are the most abundant type, making up ~20% of the total repetitive elements, followed by the microsatellites (14%). The prevalence of repetitive elements, especially the mobile elements, may have provided a driving force for the evolution of the catfish genome. A number of catfish-specific repetitive elements were identified including the previously reported Xba elements whose divergence rate was relatively low, slower than that in untranslated regions of genes but faster than the protein coding sequences, suggesting its evolutionary restrictions.

  5. The annotation of repetitive elements in the genome of channel catfish (Ictalurus punctatus)

    PubMed Central

    Yuan, Zihao; Zhou, Tao; Bao, Lisui; Liu, Shikai; Shi, Huitong; Yang, Yujia; Gao, Dongya; Dunham, Rex; Waldbieser, Geoff

    2018-01-01

    Channel catfish (Ictalurus punctatus) is a highly adaptive species and has been used as a research model for comparative immunology, physiology, and toxicology among ectothermic vertebrates. It is also economically important for aquaculture. As such, its reference genome was generated and annotated with protein coding genes. However, the repetitive elements in the catfish genome are less well understood. In this study, over 417.8 Megabase (MB) of repetitive elements were identified and characterized in the channel catfish genome. Among them, the DNA/TcMar-Tc1 transposons are the most abundant type, making up ~20% of the total repetitive elements, followed by the microsatellites (14%). The prevalence of repetitive elements, especially the mobile elements, may have provided a driving force for the evolution of the catfish genome. A number of catfish-specific repetitive elements were identified including the previously reported Xba elements whose divergence rate was relatively low, slower than that in untranslated regions of genes but faster than the protein coding sequences, suggesting its evolutionary restrictions. PMID:29763462

  6. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder.

    PubMed

    Werling, Donna M; Brand, Harrison; An, Joon-Yong; Stone, Matthew R; Zhu, Lingxue; Glessner, Joseph T; Collins, Ryan L; Dong, Shan; Layer, Ryan M; Markenscoff-Papadimitriou, Eirene; Farrell, Andrew; Schwartz, Grace B; Wang, Harold Z; Currall, Benjamin B; Zhao, Xuefang; Dea, Jeanselle; Duhn, Clif; Erdman, Carolyn A; Gilson, Michael C; Yadav, Rachita; Handsaker, Robert E; Kashin, Seva; Klei, Lambertus; Mandell, Jeffrey D; Nowakowski, Tomasz J; Liu, Yuwen; Pochareddy, Sirisha; Smith, Louw; Walker, Michael F; Waterman, Matthew J; He, Xin; Kriegstein, Arnold R; Rubenstein, John L; Sestan, Nenad; McCarroll, Steven A; Neale, Benjamin M; Coon, Hilary; Willsey, A Jeremy; Buxbaum, Joseph D; Daly, Mark J; State, Matthew W; Quinlan, Aaron R; Marth, Gabor T; Roeder, Kathryn; Devlin, Bernie; Talkowski, Michael E; Sanders, Stephan J

    2018-05-01

    Genomic association studies of common or rare protein-coding variation have established robust statistical approaches to account for multiple testing. Here we present a comparable framework to evaluate rare and de novo noncoding single-nucleotide variants, insertion/deletions, and all classes of structural variation from whole-genome sequencing (WGS). Integrating genomic annotations at the level of nucleotides, genes, and regulatory regions, we define 51,801 annotation categories. Analyses of 519 autism spectrum disorder families did not identify association with any categories after correction for 4,123 effective tests. Without appropriate correction, biologically plausible associations are observed in both cases and controls. Despite excluding previously identified gene-disrupting mutations, coding regions still exhibited the strongest associations. Thus, in autism, the contribution of de novo noncoding variation is probably modest in comparison to that of de novo coding variants. Robust results from future WGS studies will require large cohorts and comprehensive analytical strategies that consider the substantial multiple-testing burden.

  7. Benchmarking database performance for genomic data.

    PubMed

    Khushi, Matloob

    2015-06-01

    Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a novel region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of >1000 datasets of transcription factor binding sites and histone marks, collected from previous publications, were reported and it was found that HNF4G significantly co-locates with cohesin subunit STAG1 (SA1).Inc. © 2015 Wiley Periodicals, Inc.

  8. Trans-ethnic Meta-analysis and Functional Annotation Illuminates the Genetic Architecture of Fasting Glucose and Insulin.

    PubMed

    Liu, Ching-Ti; Raghavan, Sridharan; Maruthur, Nisa; Kabagambe, Edmond Kato; Hong, Jaeyoung; Ng, Maggie C Y; Hivert, Marie-France; Lu, Yingchang; An, Ping; Bentley, Amy R; Drolet, Anne M; Gaulton, Kyle J; Guo, Xiuqing; Armstrong, Loren L; Irvin, Marguerite R; Li, Man; Lipovich, Leonard; Rybin, Denis V; Taylor, Kent D; Agyemang, Charles; Palmer, Nicholette D; Cade, Brian E; Chen, Wei-Min; Dauriz, Marco; Delaney, Joseph A C; Edwards, Todd L; Evans, Daniel S; Evans, Michele K; Lange, Leslie A; Leong, Aaron; Liu, Jingmin; Liu, Yongmei; Nayak, Uma; Patel, Sanjay R; Porneala, Bianca C; Rasmussen-Torvik, Laura J; Snijder, Marieke B; Stallings, Sarah C; Tanaka, Toshiko; Yanek, Lisa R; Zhao, Wei; Becker, Diane M; Bielak, Lawrence F; Biggs, Mary L; Bottinger, Erwin P; Bowden, Donald W; Chen, Guanjie; Correa, Adolfo; Couper, David J; Crawford, Dana C; Cushman, Mary; Eicher, John D; Fornage, Myriam; Franceschini, Nora; Fu, Yi-Ping; Goodarzi, Mark O; Gottesman, Omri; Hara, Kazuo; Harris, Tamara B; Jensen, Richard A; Johnson, Andrew D; Jhun, Min A; Karter, Andrew J; Keller, Margaux F; Kho, Abel N; Kizer, Jorge R; Krauss, Ronald M; Langefeld, Carl D; Li, Xiaohui; Liang, Jingling; Liu, Simin; Lowe, William L; Mosley, Thomas H; North, Kari E; Pacheco, Jennifer A; Peyser, Patricia A; Patrick, Alan L; Rice, Kenneth M; Selvin, Elizabeth; Sims, Mario; Smith, Jennifer A; Tajuddin, Salman M; Vaidya, Dhananjay; Wren, Mary P; Yao, Jie; Zhu, Xiaofeng; Ziegler, Julie T; Zmuda, Joseph M; Zonderman, Alan B; Zwinderman, Aeilko H; Adeyemo, Adebowale; Boerwinkle, Eric; Ferrucci, Luigi; Hayes, M Geoffrey; Kardia, Sharon L R; Miljkovic, Iva; Pankow, James S; Rotimi, Charles N; Sale, Michele M; Wagenknecht, Lynne E; Arnett, Donna K; Chen, Yii-Der Ida; Nalls, Michael A; Province, Michael A; Kao, W H Linda; Siscovick, David S; Psaty, Bruce M; Wilson, James G; Loos, Ruth J F; Dupuis, Josée; Rich, Stephen S; Florez, Jose C; Rotter, Jerome I; Morris, Andrew P; Meigs, James B

    2016-07-07

    Knowledge of the genetic basis of the type 2 diabetes (T2D)-related quantitative traits fasting glucose (FG) and insulin (FI) in African ancestry (AA) individuals has been limited. In non-diabetic subjects of AA (n = 20,209) and European ancestry (EA; n = 57,292), we performed trans-ethnic (AA+EA) fine-mapping of 54 established EA FG or FI loci with detailed functional annotation, assessed their relevance in AA individuals, and sought previously undescribed loci through trans-ethnic (AA+EA) meta-analysis. We narrowed credible sets of variants driving association signals for 22/54 EA-associated loci; 18/22 credible sets overlapped with active islet-specific enhancers or transcription factor (TF) binding sites, and 21/22 contained at least one TF motif. Of the 54 EA-associated loci, 23 were shared between EA and AA. Replication with an additional 10,096 AA individuals identified two previously undescribed FI loci, chrX FAM133A (rs213676) and chr5 PELO (rs6450057). Trans-ethnic analyses with regulatory annotation illuminate the genetic architecture of glycemic traits and suggest gene regulation as a target to advance precision medicine for T2D. Our approach to utilize state-of-the-art functional annotation and implement trans-ethnic association analysis for discovery and fine-mapping offers a framework for further follow-up and characterization of GWAS signals of complex trait loci. Copyright © 2016 American Society of Human Genetics. All rights reserved.

  9. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  10. A Weighted Multipath Measurement Based on Gene Ontology for Estimating Gene Products Similarity

    PubMed Central

    Liu, Lizhen; Dai, Xuemin; Song, Wei; Lu, Jingli

    2014-01-01

    Abstract Many different methods have been proposed for calculating the semantic similarity of term pairs based on gene ontology (GO). Most existing methods are based on information content (IC), and the methods based on IC are used more commonly than those based on the structure of GO. However, most IC-based methods not only fail to handle identical annotations but also show a strong bias toward well-annotated proteins. We propose a new method called weighted multipath measurement (WMM) for estimating the semantic similarity of gene products based on the structure of the GO. We not only considered the contribution of every path between two GO terms but also took the depth of the lowest common ancestors into account. We assigned different weights for different kinds of edges in GO graph. The similarity values calculated by WMM can be reused because they are only relative to the characteristics of GO terms. Experimental results showed that the similarity values obtained by WMM have a higher accuracy. We compared the performance of WMM with that of other methods using GO data and gene annotation datasets for yeast and humans downloaded from the GO database. We found that WMM is more suited for prediction of gene function than most existing IC-based methods and that it can distinguish proteins with identical annotations (two proteins are annotated with the same terms) from each other. PMID:25229994

  11. Multi-label literature classification based on the Gene Ontology graph.

    PubMed

    Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

    2008-12-08

    The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

  12. Metabolic Pathway Assignment of Plant Genes based on Phylogenetic Profiling–A Feasibility Study

    PubMed Central

    Weißenborn, Sandra; Walther, Dirk

    2017-01-01

    Despite many developed experimental and computational approaches, functional gene annotation remains challenging. With the rapidly growing number of sequenced genomes, the concept of phylogenetic profiling, which predicts functional links between genes that share a common co-occurrence pattern across different genomes, has gained renewed attention as it promises to annotate gene functions based on presence/absence calls alone. We applied phylogenetic profiling to the problem of metabolic pathway assignments of plant genes with a particular focus on secondary metabolism pathways. We determined phylogenetic profiles for 40,960 metabolic pathway enzyme genes with assigned EC numbers from 24 plant species based on sequence and pathway annotation data from KEGG and Ensembl Plants. For gene sequence family assignments, needed to determine the presence or absence of particular gene functions in the given plant species, we included data of all 39 species available at the Ensembl Plants database and established gene families based on pairwise sequence identities and annotation information. Aside from performing profiling comparisons, we used machine learning approaches to predict pathway associations from phylogenetic profiles alone. Selected metabolic pathways were indeed found to be composed of gene families of greater than expected phylogenetic profile similarity. This was particularly evident for primary metabolism pathways, whereas for secondary pathways, both the available annotation in different species as well as the abstraction of functional association via distinct pathways proved limiting. While phylogenetic profile similarity was generally not found to correlate with gene co-expression, direct physical interactions of proteins were reflected by a significantly increased profile similarity suggesting an application of phylogenetic profiling methods as a filtering step in the identification of protein-protein interactions. This feasibility study highlights the potential and challenges associated with phylogenetic profiling methods for the detection of functional relationships between genes as well as the need to enlarge the set of plant genes with proven secondary metabolism involvement as well as the limitations of distinct pathways as abstractions of relationships between genes. PMID:29163570

  13. The language of gene ontology: a Zipf's law analysis.

    PubMed

    Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy

    2012-06-07

    Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

  14. Metabolic network prediction through pairwise rational kernels.

    PubMed

    Roche-Lima, Abiel; Domaratzki, Michael; Fristensky, Brian

    2014-09-26

    Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise classification methods are supervised learning methods used to classify new pair of entities. Some of these classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that represent similarity measures between sequences or automata. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations. We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined with other simple kernels that include evolutionary information, the accuracy values have been improved, while maintaining lower construction and execution times. The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods. When we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.

  15. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to themore » un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.« less

  16. Evaluating Hierarchical Structure in Music Annotations

    PubMed Central

    McFee, Brian; Nieto, Oriol; Farbood, Morwaread M.; Bello, Juan Pablo

    2017-01-01

    Music exhibits structure at multiple scales, ranging from motifs to large-scale functional components. When inferring the structure of a piece, different listeners may attend to different temporal scales, which can result in disagreements when they describe the same piece. In the field of music informatics research (MIR), it is common to use corpora annotated with structural boundaries at different levels. By quantifying disagreements between multiple annotators, previous research has yielded several insights relevant to the study of music cognition. First, annotators tend to agree when structural boundaries are ambiguous. Second, this ambiguity seems to depend on musical features, time scale, and genre. Furthermore, it is possible to tune current annotation evaluation metrics to better align with these perceptual differences. However, previous work has not directly analyzed the effects of hierarchical structure because the existing methods for comparing structural annotations are designed for “flat” descriptions, and do not readily generalize to hierarchical annotations. In this paper, we extend and generalize previous work on the evaluation of hierarchical descriptions of musical structure. We derive an evaluation metric which can compare hierarchical annotations holistically across multiple levels. sing this metric, we investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter-annotator agreement. PMID:28824514

  17. PathFinder: reconstruction and dynamic visualization of metabolic pathways.

    PubMed

    Goesmann, Alexander; Haubrock, Martin; Meyer, Folker; Kalinowski, Jörn; Giegerich, Robert

    2002-01-01

    Beyond methods for a gene-wise annotation and analysis of sequenced genomes new automated methods for functional analysis on a higher level are needed. The identification of realized metabolic pathways provides valuable information on gene expression and regulation. Detection of incomplete pathways helps to improve a constantly evolving genome annotation or discover alternative biochemical pathways. To utilize automated genome analysis on the level of metabolic pathways new methods for the dynamic representation and visualization of pathways are needed. PathFinder is a tool for the dynamic visualization of metabolic pathways based on annotation data. Pathways are represented as directed acyclic graphs, graph layout algorithms accomplish the dynamic drawing and visualization of the metabolic maps. A more detailed analysis of the input data on the level of biochemical pathways helps to identify genes and detect improper parts of annotations. As an Relational Database Management System (RDBMS) based internet application PathFinder reads a list of EC-numbers or a given annotation in EMBL- or Genbank-format and dynamically generates pathway graphs.

  18. Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs

    PubMed Central

    Takeda, Jun-ichi; Suzuki, Yutaka; Nakao, Mitsuteru; Barrero, Roberto A.; Koyanagi, Kanako O.; Jin, Lihua; Motono, Chie; Hata, Hiroko; Isogai, Takao; Nagai, Keiichi; Otsuki, Tetsuji; Kuryshev, Vladimir; Shionyu, Masafumi; Yura, Kei; Go, Mitiko; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Wiemann, Stefan; Nomura, Nobuo; Sugano, Sumio; Gojobori, Takashi; Imanishi, Tadashi

    2006-01-01

    We report the first genome-wide identification and characterization of alternative splicing in human gene transcripts based on analysis of the full-length cDNAs. Applying both manual and computational analyses for 56 419 completely sequenced and precisely annotated full-length cDNAs selected for the H-Invitational human transcriptome annotation meetings, we identified 6877 alternative splicing genes with 18 297 different alternative splicing variants. A total of 37 670 exons were involved in these alternative splicing events. The encoded protein sequences were affected in 6005 of the 6877 genes. Notably, alternative splicing affected protein motifs in 3015 genes, subcellular localizations in 2982 genes and transmembrane domains in 1348 genes. We also identified interesting patterns of alternative splicing, in which two distinct genes seemed to be bridged, nested or having overlapping protein coding sequences (CDSs) of different reading frames (multiple CDS). In these cases, completely unrelated proteins are encoded by a single locus. Genome-wide annotations of alternative splicing, relying on full-length cDNAs, should lay firm groundwork for exploring in detail the diversification of protein function, which is mediated by the fast expanding universe of alternative splicing variants. PMID:16914452

  19. A transversal approach to predict gene product networks from ontology-based similarity

    PubMed Central

    Chabalier, Julie; Mosser, Jean; Burgun, Anita

    2007-01-01

    Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression. PMID:17605807

  20. rSNPBase 3.0: an updated database of SNP-related regulatory elements, element-gene pairs and SNP-based gene regulatory networks.

    PubMed

    Guo, Liyuan; Wang, Jing

    2018-01-04

    Here, we present the updated rSNPBase 3.0 database (http://rsnp3.psych.ac.cn), which provides human SNP-related regulatory elements, element-gene pairs and SNP-based regulatory networks. This database is the updated version of the SNP regulatory annotation database rSNPBase and rVarBase. In comparison to the last two versions, there are both structural and data adjustments in rSNPBase 3.0: (i) The most significant new feature is the expansion of analysis scope from SNP-related regulatory elements to include regulatory element-target gene pairs (E-G pairs), therefore it can provide SNP-based gene regulatory networks. (ii) Web function was modified according to data content and a new network search module is provided in the rSNPBase 3.0 in addition to the previous regulatory SNP (rSNP) search module. The two search modules support data query for detailed information (related-elements, element-gene pairs, and other extended annotations) on specific SNPs and SNP-related graphic networks constructed by interacting transcription factors (TFs), miRNAs and genes. (3) The type of regulatory elements was modified and enriched. To our best knowledge, the updated rSNPBase 3.0 is the first data tool supports SNP functional analysis from a regulatory network prospective, it will provide both a comprehensive understanding and concrete guidance for SNP-related regulatory studies. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. rSNPBase 3.0: an updated database of SNP-related regulatory elements, element-gene pairs and SNP-based gene regulatory networks

    PubMed Central

    2018-01-01

    Abstract Here, we present the updated rSNPBase 3.0 database (http://rsnp3.psych.ac.cn), which provides human SNP-related regulatory elements, element-gene pairs and SNP-based regulatory networks. This database is the updated version of the SNP regulatory annotation database rSNPBase and rVarBase. In comparison to the last two versions, there are both structural and data adjustments in rSNPBase 3.0: (i) The most significant new feature is the expansion of analysis scope from SNP-related regulatory elements to include regulatory element–target gene pairs (E–G pairs), therefore it can provide SNP-based gene regulatory networks. (ii) Web function was modified according to data content and a new network search module is provided in the rSNPBase 3.0 in addition to the previous regulatory SNP (rSNP) search module. The two search modules support data query for detailed information (related-elements, element-gene pairs, and other extended annotations) on specific SNPs and SNP-related graphic networks constructed by interacting transcription factors (TFs), miRNAs and genes. (3) The type of regulatory elements was modified and enriched. To our best knowledge, the updated rSNPBase 3.0 is the first data tool supports SNP functional analysis from a regulatory network prospective, it will provide both a comprehensive understanding and concrete guidance for SNP-related regulatory studies. PMID:29140525

  2. Multiplex ligation-dependent probe amplification analysis on capillary electrophoresis instruments for a rapid gene copy number study.

    PubMed

    Jankowski, Stéphane; Currie-Fraser, Erica; Xu, Licen; Coffa, Jordy

    2008-09-01

    Annotated DNA samples that had been previously analyzed were tested using multiplex ligation-dependent probe amplification (MLPA) assays containing probes targeting BRCA1, BRCA2, and MMR (MLH1/MSH2 genes) and the 9p21 chromosomal region. MLPA polymerase chain reaction products were separated on a capillary electrophoresis platform, and the data were analyzed using GeneMapper v4.0 software (Applied Biosystems, Foster City, CA). After signal normalization, loci regions that had undergone deletions or duplications were identified using the GeneMapper Report Manager and verified using the DyeScale functionality. The results highlight an easy-to-use, optimal sample preparation and analysis workflow that can be used for both small- and large-scale studies.

  3. The Genome Sequence of Mannheimia haemolytica A1: Insights into Virulence, Natural Competence, and Pasteurellaceae Phylogeny†

    PubMed Central

    Gioia, Jason; Qin, Xiang; Jiang, Huaiyang; Clinkenbeard, Kenneth; Lo, Reggie; Liu, Yamei; Fox, George E.; Yerrapragada, Shailaja; McLeod, Michael P.; McNeill, Thomas Z.; Hemphill, Lisa; Sodergren, Erica; Wang, Qiaoyan; Muzny, Donna M.; Homsi, Farah J.; Weinstock, George M.; Highlander, Sarah K.

    2006-01-01

    The draft genome sequence of Mannheimia haemolytica A1, the causative agent of bovine respiratory disease complex (BRDC), is presented. Strain ATCC BAA-410, isolated from the lung of a calf with BRDC, was the DNA source. The annotated genome includes 2,839 coding sequences, 1,966 of which were assigned a function and 436 of which are unique to M. haemolytica. Through genome annotation many features of interest were identified, including bacteriophages and genes related to virulence, natural competence, and transcriptional regulation. In addition to previously described virulence factors, M. haemolytica encodes adhesins, including the filamentous hemagglutinin FhaB and two trimeric autotransporter adhesins. Two dual-function immunoglobulin-protease/adhesins are also present, as is a third immunoglobulin protease. Genes related to iron acquisition and drug resistance were identified and are likely important for survival in the host and virulence. Analysis of the genome indicates that M. haemolytica is naturally competent, as genes for natural competence and DNA uptake signal sequences (USS) are present. Comparison of competence loci and USS in other species in the family Pasteurellaceae indicates that M. haemolytica, Actinobacillus pleuropneumoniae, and Haemophilus ducreyi form a lineage distinct from other Pasteurellaceae. This observation was supported by a phylogenetic analysis using sequences of predicted housekeeping genes. PMID:17015664

  4. Mapping annotations with textual evidence using an scLDA model.

    PubMed

    Jin, Bo; Chen, Vicky; Chen, Lujia; Lu, Xinghua

    2011-01-01

    Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.

  5. MalaCards: an integrated compendium for diseases and their annotation

    PubMed Central

    Rappaport, Noa; Nativ, Noam; Stelzer, Gil; Twik, Michal; Guan-Golan, Yaron; Iny Stein, Tsippi; Bahir, Iris; Belinky, Frida; Morrey, C. Paul; Safran, Marilyn; Lancet, Doron

    2013-01-01

    Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16 919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches and GeneDecks set analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we use four schemes to populate MalaCards sections: (i) directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests and anatomical context; (ii) searching GeneCards for related publications, and for associated genes with corresponding relevance scores; (iii) analyzing disease-associated gene sets in GeneDecks to yield affiliated pathways, phenotypes, compounds and GO terms, sorted by a composite relevance score and presented with GeneCards links; and (iv) searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, suggesting that this might be an inherent property of such networks. Work in progress includes hierarchical malady classification, ontological mapping and disease set analyses, striving to make MalaCards an even more effective tool for biomedical research. Database URL: http://www.malacards.org/ PMID:23584832

  6. Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

    PubMed

    Wang, Qiong; Fish, Jordan A; Gilman, Mariah; Sun, Yanni; Brown, C Titus; Tiedje, James M; Cole, James R

    2015-01-01

    Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility of this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. Xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines. This method is implemented as open source software and is available at https://github.com/rdpstaff/Xander_assembler.

  7. ExpTreeDB: web-based query and visualization of manually annotated gene expression profiling experiments of human and mouse from GEO.

    PubMed

    Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen

    2014-12-01

    Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  8. Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome

    PubMed Central

    Stolc, Viktor; Deng, Wei; He, Hang; Korbel, Jan; Chen, Xuewei; Tongprasit, Waraporn; Ronald, Pamela; Chen, Runsheng; Gerstein, Mark; Wang Deng, Xing

    2007-01-01

    Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome. PMID:17372628

  9. Genome-wide transcription start site profiling in biofilm-grown Burkholderia cenocepacia J2315.

    PubMed

    Sass, Andrea M; Van Acker, Heleen; Förstner, Konrad U; Van Nieuwerburgh, Filip; Deforce, Dieter; Vogel, Jörg; Coenye, Tom

    2015-10-13

    Burkholderia cenocepacia is a soil-dwelling Gram-negative Betaproteobacterium with an important role as opportunistic pathogen in humans. Infections with B. cenocepacia are very difficult to treat due to their high intrinsic resistance to most antibiotics. Biofilm formation further adds to their antibiotic resistance. B. cenocepacia harbours a large, multi-replicon genome with a high GC-content, the reference genome of strain J2315 includes 7374 annotated genes. This study aims to annotate transcription start sites and identify novel transcripts on a whole genome scale. RNA extracted from B. cenocepacia J2315 biofilms was analysed by differential RNA-sequencing and the resulting dataset compared to data derived from conventional, global RNA-sequencing. Transcription start sites were annotated and further analysed according to their position relative to annotated genes. Four thousand ten transcription start sites were mapped over the whole B. cenocepacia genome and the primary transcription start site of 2089 genes expressed in B. cenocepacia biofilms were defined. For 64 genes a start codon alternative to the annotated one was proposed. Substantial antisense transcription for 105 genes and two novel protein coding sequences were identified. The distribution of internal transcription start sites can be used to identify genomic islands in B. cenocepacia. A potassium pump strongly induced only under biofilm conditions was found and 15 non-coding small RNAs highly expressed in biofilms were discovered. Mapping transcription start sites across the B. cenocepacia genome added relevant information to the J2315 annotation. Genes and novel regulatory RNAs putatively involved in B. cenocepacia biofilm formation were identified. These findings will help in understanding regulation of B. cenocepacia biofilm formation.

  10. Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA

    PubMed Central

    Djebali, Sarah; Delaplace, Franck; Crollius, Hugues Roest

    2006-01-01

    Background Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism. Results We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts. Conclusion We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement. PMID:16925841

  11. Solving the Problem: Genome Annotation Standards before the Data Deluge.

    PubMed

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana

    2011-10-15

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

  12. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  13. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs).

    PubMed

    Natale, D A; Shankavaram, U T; Galperin, M Y; Wolf, Y I; Aravind, L; Koonin, E V

    2000-01-01

    Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.

  14. Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs)

    PubMed Central

    Natale, Darren A; Shankavaram, Uma T; Galperin, Michael Y; Wolf, Yuri I; Aravind, L; Koonin, Eugene V

    2000-01-01

    Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange. PMID:11178258

  15. Genome-Wide Identification and Mapping of NBS-Encoding Resistance Genes in Solanum tuberosum Group Phureja

    PubMed Central

    Lozano, Roberto; Ponce, Olga; Ramirez, Manuel; Mostajo, Nelly; Orjeda, Gisella

    2012-01-01

    The majority of disease resistance (R) genes identified to date in plants encode a nucleotide-binding site (NBS) and leucine-rich repeat (LRR) domain containing protein. Additional domains such as coiled-coil (CC) and TOLL/interleukin-1 receptor (TIR) domains can also be present. In the recently sequenced Solanum tuberosum group phureja genome we used HMM models and manual curation to annotate 435 NBS-encoding R gene homologs and 142 NBS-derived genes that lack the NBS domain. Highly similar homologs for most previously documented Solanaceae R genes were identified. A surprising ∼41% (179) of the 435 NBS-encoding genes are pseudogenes primarily caused by premature stop codons or frameshift mutations. Alignment of 81.80% of the 577 homologs to S. tuberosum group phureja pseudomolecules revealed non-random distribution of the R-genes; 362 of 470 genes were found in high density clusters on 11 chromosomes. PMID:22493716

  16. A-WINGS: an integrated genome database for Pleurocybella porrigens (Angel's wing oyster mushroom, Sugihiratake).

    PubMed

    Yamamoto, Naoki; Suzuki, Tomohiro; Kobayashi, Masaaki; Dohra, Hideo; Sasaki, Yohei; Hirai, Hirofumi; Yokoyama, Koji; Kawagishi, Hirokazu; Yano, Kentaro

    2014-12-03

    The angel's wing oyster mushroom (Pleurocybella porrigens, Sugihiratake) is a well-known delicacy. However, its potential risk in acute encephalopathy was recently revealed by a food poisoning incident. To disclose the genes underlying the accident and provide mechanistic insight, we seek to develop an information infrastructure containing omics data. In our previous work, we sequenced the genome and transcriptome using next-generation sequencing techniques. The next step in achieving our goal is to develop a web database to facilitate the efficient mining of large-scale omics data and identification of genes specifically expressed in the mushroom. This paper introduces a web database A-WINGS (http://bioinf.mind.meiji.ac.jp/a-wings/) that provides integrated genomic and transcriptomic information for the angel's wing oyster mushroom. The database contains structure and functional annotations of transcripts and gene expressions. Functional annotations contain information on homologous sequences from NCBI nr and UniProt, Gene Ontology, and KEGG Orthology. Digital gene expression profiles were derived from RNA sequencing (RNA-seq) analysis in the fruiting bodies and mycelia. The omics information stored in the database is freely accessible through interactive and graphical interfaces by search functions that include 'GO TREE VIEW' browsing, keyword searches, and BLAST searches. The A-WINGS database will accelerate omics studies on specific aspects of the angel's wing oyster mushroom and the family Tricholomataceae.

  17. Transcriptome sequencing and annotation of the microalgae Dunaliella tertiolecta: Pathway description and gene discovery for production of next-generation biofuels

    PubMed Central

    2011-01-01

    Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935

  18. Initial description of primate-specific cystine-knot Prometheus genes and differential gene expansions of D-dopachrome tautomerase genes

    PubMed Central

    Premzl, Marko

    2015-01-01

    Using eutherian comparative genomic analysis protocol and public genomic sequence data sets, the present work attempted to update and revise two gene data sets. The most comprehensive third party annotation gene data sets of eutherian adenohypophysis cystine-knot genes (128 complete coding sequences), and d-dopachrome tautomerases and macrophage migration inhibitory factor genes (30 complete coding sequences) were annotated. For example, the present study first described primate-specific cystine-knot Prometheus genes, as well as differential gene expansions of D-dopachrome tautomerase genes. Furthermore, new frameworks of future experiments of two eutherian gene data sets were proposed. PMID:25941635

  19. BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data

    PubMed Central

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  20. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentousmore » ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.« less

  1. Genomic assessment of the evolution of the prion protein gene family in vertebrates.

    PubMed

    Harrison, Paul M; Khachane, Amit; Kumar, Manish

    2010-05-01

    Prion diseases are devastating neurological disorders caused by the propagation of particles containing an alternative beta-sheet-rich form of the prion protein (PrP). Genes paralogous to PrP, called Doppel and Shadoo, have been identified, that also have neuropathological relevance. To aid in the further functional characterization of PrP and its relatives, we annotated completely the PrP gene family (PrP-GF), in the genomes of 42 vertebrates, through combined strategic application of gene prediction programs and advanced remote homology detection techniques (such as HMMs, PSI-TBLASTN and pGenThreader). We have uncovered several previously undescribed paralogous genes and pseudogenes. We find that current high-quality genomic evidence indicates that the PrP relative Doppel, was likely present in the last common ancestor of present-day Tetrapoda, but was lost in the bird lineage, since its divergence from reptiles. Using the new gene annotations, we have defined the consensus of structural features that are characteristic of the PrP and Doppel structures, across diverse Tetrapoda clades. Furthermore, we describe in detail a transcribed pseudogene derived from Shadoo that is conserved across primates, and that overlaps the meiosis gene, SYCE1, thus possibly regulating its expression. In addition, we analysed the locus of PRNP/PRND for significant conservation across the genomic DNA of eleven mammals, and determined the phylogenetic penetration of non-coding exons. The genomic evidence indicates that the second PRNP non-coding exon found in even-toed ungulates and rodents, is conserved in all high-coverage genome assemblies of primates (human, chimp, orang utan and macaque), and is, at least, likely to have fallen out of use during primate speciation. Furthermore, we have demonstrated that the PRNT gene (at the PRNP human locus) is conserved across at least sixteen mammals, and evolves like a long non-coding RNA, fashioned from fragments of ancient, long, interspersed elements. These annotations and evolutionary analyses will be of further use for functional characterisation of the PrP-GF, and will be updatable in a semi-automated fashion as more genomes accumulate. Copyright 2010 Elsevier Inc. All rights reserved.

  2. Evolutionary interrogation of human biology in well-annotated genomic framework of rhesus macaque.

    PubMed

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-05-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.

  3. Gene expression profiling of human alveolar macrophages infected by B. anthracis spores demonstrates TNF-α and NF-κb are key components of the innate immune response to the pathogen

    PubMed Central

    2009-01-01

    Background Bacillus anthracis, the etiologic agent of anthrax, has recently been used as an agent of bioterrorism. The innate immune system initially appears to contain the pathogen at the site of entry. Because the human alveolar macrophage (HAM) plays a key role in lung innate immune responses, studying the HAM response to B. anthracis is important in understanding the pathogenesis of the pulmonary form of this disease. Methods In this paper, the transcriptional profile of B. anthracis spore-treated HAM was compared with that of mock-infected cells, and differentially expressed genes were identified by Affymetrix microarray analysis. A portion of the results were verified by Luminex protein analysis. Results The majority of genes modulated by spores were upregulated, and a lesser number were downregulated. The differentially expressed genes were subjected to Ingenuity Pathway analysis, the Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis, the Promoter Analysis and Interaction Network Toolset (PAINT) and Oncomine analysis. Among the upregulated genes, we identified a group of chemokine ligand, apoptosis, and, interestingly, keratin filament genes. Central hubs regulating the activated genes were TNF-α, NF-κB and their ligands/receptors. In addition to TNF-α, a broad range of cytokines was induced, and this was confirmed at the level of translation by Luminex multiplex protein analysis. PAINT analysis revealed that many of the genes affected by spores contain the binding site for c-Rel, a member of the NF-κB family of transcription factors. Other transcription regulatory elements contained in many of the upregulated genes were c-Myb, CP2, Barbie Box, E2F and CRE-BP1. However, many of the genes are poorly annotated, indicating that they represent novel functions. Four of the genes most highly regulated by spores have only previously been associated with head and neck and lung carcinomas. Conclusion The results demonstrate not only that TNF-α and NF-κb are key components of the innate immune response to the pathogen, but also that a large part of the mechanisms by which the alveolar macrophage responds to B. anthracis are still unknown as many of the genes involved are poorly annotated. PMID:19744333

  4. The what, where, how and why of gene ontology—a primer for bioinformaticians

    PubMed Central

    du Plessis, Louis; Škunca, Nives

    2011-01-01

    With high-throughput technologies providing vast amounts of data, it has become more important to provide systematic, quality annotations. The Gene Ontology (GO) project is the largest resource for cataloguing gene function. Nonetheless, its use is not yet ubiquitous and is still fraught with pitfalls. In this review, we provide a short primer to the GO for bioinformaticians. We summarize important aspects of the structure of the ontology, describe sources and types of functional annotations, survey measures of GO annotation similarity, review typical uses of GO and discuss other important considerations pertaining to the use of GO in bioinformatics applications. PMID:21330331

  5. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries

    PubMed Central

    Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P

    2008-01-01

    Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects. PMID:18402700

  6. Sequencing, annotation and comparative analysis of nine BACs of giant panda (Ailuropoda melanoleuca).

    PubMed

    Zheng, Yang; Cai, Jing; Li, JianWen; Li, Bo; Lin, Runmao; Tian, Feng; Wang, XiaoLing; Wang, Jun

    2010-01-01

    A 10-fold BAC library for giant panda was constructed and nine BACs were selected to generate finish sequences. These BACs could be used as a validation resource for the de novo assembly accuracy of the whole genome shotgun sequencing reads of giant panda newly generated by the Illumina GA sequencing technology. Complete sanger sequencing, assembly, annotation and comparative analysis were carried out on the selected BACs of a joint length 878 kb. Homologue search and de novo prediction methods were used to annotate genes and repeats. Twelve protein coding genes were predicted, seven of which could be functionally annotated. The seven genes have an average gene size of about 41 kb, an average coding size of about 1.2 kb and an average exon number of 6 per gene. Besides, seven tRNA genes were found. About 27 percent of the BAC sequence is composed of repeats. A phylogenetic tree was constructed using neighbor-join algorithm across five species, including giant panda, human, dog, cat and mouse, which reconfirms dog as the most related species to giant panda. Our results provide detailed sequence and structure information for new genes and repeats of giant panda, which will be helpful for further studies on the giant panda.

  7. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries.

    PubMed

    Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P

    2008-04-10

    Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects.

  8. Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea.

    PubMed

    Veloso, Felipe; Riadi, Gonzalo; Aliaga, Daniela; Lieph, Ryan; Holmes, David S

    2005-01-01

    Analysis of over 300,000 annotated genes in 105 bacterial and archaeal genomes reveals an unexpectedly high frequency of large (>300 nucleotides) alternate open reading frames (ORFs). Especially notable is the very high frequency of alternate ORFs in frames +3 and -1 (where the annotated gene is defined as frame +1). The occurrence of alternate ORFs is correlated with genomic G+C content and is strongly influenced by synonymous codon usage bias. The frequency of alternate ORFs in frame -1 is also influenced by the occurrence of codons encoding leucine and serine in frame +1. Although some alternate ORFs have been shown to encode proteins, many others are probably not expressed because they lack appropriate signals for transcription and translation. These latter can be mis-annotated by automatic gene finding programs leading to errors in public databases. Especially prone to mis-annotation is frame -1, because it exhibits a potential codon usage and theoretical capacity to encode proteins with an amino acid composition most similar to real genes. Some alternate ORFs are conserved across bacterial or archaeal species, and can give rise to misannotated "conserved hypothetical" genes, while others are unique to a genome and are misidentified as "hypothetical orphan" genes, contributing significantly to the orphan gene paradox.

  9. Fine-scale genetic mapping of a hybrid sterility factor between Drosophila simulans and D. mauritiana: the varied and elusive functions of "speciation genes".

    PubMed

    Araripe, Luciana O; Montenegro, Horácio; Lemos, Bernardo; Hartl, Daniel L

    2010-12-14

    Hybrid male sterility (HMS) is a usual outcome of hybridization between closely related animal species. It arises because interactions between alleles that are functional within one species may be disrupted in hybrids. The identification of genes leading to hybrid sterility is of great interest for understanding the evolutionary process of speciation. In the current work we used marked P-element insertions as dominant markers to efficiently locate one genetic factor causing a severe reduction in fertility in hybrid males of Drosophila simulans and D. mauritiana. Our mapping effort identified a region of 9 kb on chromosome 3, containing three complete and one partial coding sequences. Within this region, two annotated genes are suggested as candidates for the HMS factor, based on the comparative molecular characterization and public-source information. Gene Taf1 is partially contained in the region, but yet shows high polymorphism with four fixed non-synonymous substitutions between the two species. Its molecular functions involve sequence-specific DNA binding and transcription factor activity. Gene agt is a small, intronless gene, whose molecular function is annotated as methylated-DNA-protein-cysteine S-methyltransferase activity. High polymorphism and one fixed non-synonymous substitution suggest this is a fast evolving gene. The gene trees of both genes perfectly separate D. simulans and D. mauritiana into monophyletic groups. Analysis of gene expression using microarray revealed trends that were similar to those previously found in comparisons between whole-genome hybrids and parental species. The identification following confirmation of the HMS candidate gene will add another case study leading to understanding the evolutionary process of hybrid incompatibility.

  10. RNA sequencing reveals sexually dimorphic gene expression before gonadal differentiation in chicken and allows comprehensive annotation of the W-chromosome

    PubMed Central

    2013-01-01

    Background Birds have a ZZ male: ZW female sex chromosome system and while the Z-linked DMRT1 gene is necessary for testis development, the exact mechanism of sex determination in birds remains unsolved. This is partly due to the poor annotation of the W chromosome, which is speculated to carry a female determinant. Few genes have been mapped to the W and little is known of their expression. Results We used RNA-seq to produce a comprehensive profile of gene expression in chicken blastoderms and embryonic gonads prior to sexual differentiation. We found robust sexually dimorphic gene expression in both tissues pre-dating gonadogenesis, including sex-linked and autosomal genes. This supports the hypothesis that sexual differentiation at the molecular level is at least partly cell autonomous in birds. Different sets of genes were sexually dimorphic in the two tissues, indicating that molecular sexual differentiation is tissue specific. Further analyses allowed the assembly of full-length transcripts for 26 W chromosome genes, providing a view of the W transcriptome in embryonic tissues. This is the first extensive analysis of W-linked genes and their expression profiles in early avian embryos. Conclusion Sexual differentiation at the molecular level is established in chicken early in embryogenesis, before gonadal sex differentiation. We find that the W chromosome is more transcriptionally active than previously thought, expand the number of known genes to 26 and present complete coding sequences for these W genes. This includes two novel W-linked sequences and three small RNAs reassigned to the W from the Un_Random chromosome. PMID:23531366

  11. Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs

    NASA Astrophysics Data System (ADS)

    Saha, Barna; Hoch, Allison; Khuller, Samir; Raschid, Louiqa; Zhang, Xiao-Ning

    In this paper, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time. In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR). A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.

  12. Microarray data mining using Bioconductor packages.

    PubMed

    Nie, Haisheng; Neerincx, Pieter B T; van der Poel, Jan; Ferrari, Francesco; Bicciato, Silvio; Leunissen, Jack A M; Groenen, Martien A M

    2009-07-16

    This paper describes the results of a Gene Ontology (GO) term enrichment analysis of chicken microarray data using the Bioconductor packages. By checking the enriched GO terms in three contrasts, MM8-PM8, MM8-MA8, and MM8-MM24, of the provided microarray data during this workshop, this analysis aimed to investigate the host reactions in chickens occurring shortly after a secondary challenge with either a homologous or heterologous species of Eimeria. The results of GO enrichment analysis using GO terms annotated to chicken genes and GO terms annotated to chicken-human orthologous genes were also compared. Furthermore, a locally adaptive statistical procedure (LAP) was performed to test differentially expressed chromosomal regions, rather than individual genes, in the chicken genome after Eimeria challenge. GO enrichment analysis identified significant (raw p-value < 0.05) GO terms for all three contrasts included in the analysis. Some of the GO terms linked to, generally, primary immune responses or secondary immune responses indicating the GO enrichment analysis is a useful approach to analyze microarray data. The comparisons of GO enrichment results using chicken gene information and chicken-human orthologous gene information showed more refined GO terms related to immune responses when using chicken-human orthologous gene information, this suggests that using chicken-human orthologous gene information has higher power to detect significant GO terms with more refined functionality. Furthermore, three chromosome regions were identified to be significantly up-regulated in contrast MM8-PM8 (q-value < 0.01). Overall, this paper describes a practical approach to analyze microarray data in farm animals where the genome information is still incomplete. For farm animals, such as chicken, with currently limited gene annotation, borrowing gene annotation information from orthologous genes in well-annotated species, such as human, will help improve the pathway analysis results substantially. Furthermore, LAP analysis approach is a relatively new and very useful way to be applied in microarray analysis.

  13. Finding the missing honey bee genes: lessons learned from a genome upgrade.

    PubMed

    Elsik, Christine G; Worley, Kim C; Bennett, Anna K; Beye, Martin; Camara, Francisco; Childers, Christopher P; de Graaf, Dirk C; Debyser, Griet; Deng, Jixin; Devreese, Bart; Elhaik, Eran; Evans, Jay D; Foster, Leonard J; Graur, Dan; Guigo, Roderic; Hoff, Katharina Jasmin; Holder, Michael E; Hudson, Matthew E; Hunt, Greg J; Jiang, Huaiyang; Joshi, Vandita; Khetani, Radhika S; Kosarev, Peter; Kovar, Christie L; Ma, Jian; Maleszka, Ryszard; Moritz, Robin F A; Munoz-Torres, Monica C; Murphy, Terence D; Muzny, Donna M; Newsham, Irene F; Reese, Justin T; Robertson, Hugh M; Robinson, Gene E; Rueppell, Olav; Solovyev, Victor; Stanke, Mario; Stolle, Eckart; Tsuruda, Jennifer M; Vaerenbergh, Matthias Van; Waterhouse, Robert M; Weaver, Daniel B; Whitfield, Charles W; Wu, Yuanqing; Zdobnov, Evgeny M; Zhang, Lan; Zhu, Dianhui; Gibbs, Richard A

    2014-01-30

    The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.

  14. Finding the missing honey bee genes: lessons learned from a genome upgrade

    PubMed Central

    2014-01-01

    Background The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Results Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Conclusions Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination. PMID:24479613

  15. Identification of genetic elements in metabolism by high-throughput mouse phenotyping.

    PubMed

    Rozman, Jan; Rathkolb, Birgit; Oestereicher, Manuela A; Schütt, Christine; Ravindranath, Aakash Chavan; Leuchtenberger, Stefanie; Sharma, Sapna; Kistler, Martin; Willershäuser, Monja; Brommage, Robert; Meehan, Terrence F; Mason, Jeremy; Haselimashhadi, Hamed; Hough, Tertius; Mallon, Ann-Marie; Wells, Sara; Santos, Luis; Lelliott, Christopher J; White, Jacqueline K; Sorg, Tania; Champy, Marie-France; Bower, Lynette R; Reynolds, Corey L; Flenniken, Ann M; Murray, Stephen A; Nutter, Lauryl M J; Svenson, Karen L; West, David; Tocchini-Valentini, Glauco P; Beaudet, Arthur L; Bosch, Fatima; Braun, Robert B; Dobbie, Michael S; Gao, Xiang; Herault, Yann; Moshiri, Ala; Moore, Bret A; Kent Lloyd, K C; McKerlie, Colin; Masuya, Hiroshi; Tanaka, Nobuhiko; Flicek, Paul; Parkinson, Helen E; Sedlacek, Radislav; Seong, Je Kyung; Wang, Chi-Kuang Leo; Moore, Mark; Brown, Steve D; Tschöp, Matthias H; Wurst, Wolfgang; Klingenspor, Martin; Wolf, Eckhard; Beckers, Johannes; Machicao, Fausto; Peter, Andreas; Staiger, Harald; Häring, Hans-Ulrich; Grallert, Harald; Campillos, Monica; Maier, Holger; Fuchs, Helmut; Gailus-Durner, Valerie; Werner, Thomas; Hrabe de Angelis, Martin

    2018-01-18

    Metabolic diseases are a worldwide problem but the underlying genetic factors and their relevance to metabolic disease remain incompletely understood. Genome-wide research is needed to characterize so-far unannotated mammalian metabolic genes. Here, we generate and analyze metabolic phenotypic data of 2016 knockout mouse strains under the aegis of the International Mouse Phenotyping Consortium (IMPC) and find 974 gene knockouts with strong metabolic phenotypes. 429 of those had no previous link to metabolism and 51 genes remain functionally completely unannotated. We compared human orthologues of these uncharacterized genes in five GWAS consortia and indeed 23 candidate genes are associated with metabolic disease. We further identify common regulatory elements in promoters of candidate genes. As each regulatory element is composed of several transcription factor binding sites, our data reveal an extensive metabolic phenotype-associated network of co-regulated genes. Our systematic mouse phenotype analysis thus paves the way for full functional annotation of the genome.

  16. Draft genome of the leopard gecko, Eublepharis macularius.

    PubMed

    Xiong, Zijun; Li, Fang; Li, Qiye; Zhou, Long; Gamble, Tony; Zheng, Jiao; Kui, Ling; Li, Cai; Li, Shengbin; Yang, Huanming; Zhang, Guojie

    2016-10-26

    Geckos are among the most species-rich reptile groups and the sister clade to all other lizards and snakes. Geckos possess a suite of distinctive characteristics, including adhesive digits, nocturnal activity, hard, calcareous eggshells, and a lack of eyelids. However, one gecko clade, the Eublepharidae, appears to be the exception to most of these 'rules' and lacks adhesive toe pads, has eyelids, and lays eggs with soft, leathery eggshells. These differences make eublepharids an important component of any investigation into the underlying genomic innovations contributing to the distinctive phenotypes in 'typical' geckos. We report high-depth genome sequencing, assembly, and annotation for a male leopard gecko, Eublepharis macularius (Eublepharidae). Illumina sequence data were generated from seven insert libraries (ranging from 170 to 20 kb), representing a raw sequencing depth of 136X from 303 Gb of data, reduced to 84X and 187 Gb after filtering. The assembled genome of 2.02 Gb was close to the 2.23 Gb estimated by k-mer analysis. Scaffold and contig N50 sizes of 664 and 20 kb, respectively, were comparable to the previously published Gekko japonicus genome. Repetitive elements accounted for 42 % of the genome. Gene annotation yielded 24,755 protein-coding genes, of which 93 % were functionally annotated. CEGMA and BUSCO assessment showed that our assembly captured 91 % (225 of 248) of the core eukaryotic genes, and 76 % of vertebrate universal single-copy orthologs. Assembly of the leopard gecko genome provides a valuable resource for future comparative genomic studies of geckos and other squamate reptiles.

  17. Gene co-expression network analysis in Rhodobacter capsulatus and application to comparative expression analysis of Rhodobacter sphaeroides

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pena-Castillo, Lourdes; Mercer, Ryan; Gurinovich, Anastasia

    2014-08-28

    The genus Rhodobacter contains purple nonsulfur bacteria found mostly in freshwater environments. Representative strains of two Rhodobacter species, R. capsulatus and R. sphaeroides, have had their genomes fully sequenced and both have been the subject of transcriptional profiling studies. Gene co-expression networks can be used to identify modules of genes with similar expression profiles. Functional analysis of gene modules can then associate co-expressed genes with biological pathways, and network statistics can determine the degree of module preservation in related networks. In this paper, we constructed an R. capsulatus gene co-expression network, performed functional analysis of identified gene modules, and investigatedmore » preservation of these modules in R. capsulatus proteomics data and in R. sphaeroides transcriptomics data. Results: The analysis identified 40 gene co-expression modules in R. capsulatus. Investigation of the module gene contents and expression profiles revealed patterns that were validated based on previous studies supporting the biological relevance of these modules. We identified two R. capsulatus gene modules preserved in the protein abundance data. We also identified several gene modules preserved between both Rhodobacter species, which indicate that these cellular processes are conserved between the species and are candidates for functional information transfer between species. Many gene modules were non-preserved, providing insight into processes that differentiate the two species. In addition, using Local Network Similarity (LNS), a recently proposed metric for expression divergence, we assessed the expression conservation of between-species pairs of orthologs, and within-species gene-protein expression profiles. Conclusions: Our analyses provide new sources of information for functional annotation in R. capsulatus because uncharacterized genes in modules are now connected with groups of genes that constitute a joint functional annotation. We identified R. capsulatus modules enriched with genes for ribosomal proteins, porphyrin and bacteriochlorophyll anabolism, and biosynthesis of secondary metabolites to be preserved in R. sphaeroides whereas modules related to RcGTA production and signalling showed lack of preservation in R. sphaeroides. In addition, we demonstrated that network statistics may also be applied within-species to identify congruence between mRNA expression and protein abundance data for which simple correlation measurements have previously had mixed results.« less

  18. CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison

    PubMed Central

    Castrignanò, Tiziana; Canali, Alessandro; Grillo, Giorgio; Liuni, Sabino; Mignone, Flavio; Pesole, Graziano

    2004-01-01

    The identification and characterization of genome tracts that are highly conserved across species during evolution may contribute significantly to the functional annotation of whole-genome sequences. Indeed, such sequences are likely to correspond to known or unknown coding exons or regulatory motifs. Here, we present a web server implementing a previously developed algorithm that, by comparing user-submitted genome sequences, is able to identify statistically significant conserved blocks and assess their coding or noncoding nature through the measure of a coding potential score. The web tool, available at http://www.caspur.it/CSTminer/, is dynamically interconnected with the Ensembl genome resources and produces a graphical output showing a map of detected conserved sequences and annotated gene features. PMID:15215464

  19. Lynx web services for annotations and systems analysis of multi-gene disorders.

    PubMed

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia

    2014-07-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. Issues with RNA-seq analysis in non-model organisms: A salmonid example.

    PubMed

    Sundaram, Arvind; Tengs, Torstein; Grimholt, Unni

    2017-10-01

    High throughput sequencing (HTS) is useful for many purposes as exemplified by the other topics included in this special issue. The purpose of this paper is to look into the unique challenges of using this technology in non-model organisms where resources such as genomes, functional genome annotations or genome complexity provide obstacles not met in model organisms. To describe these challenges, we narrow our scope to RNA sequencing used to study differential gene expression in response to pathogen challenge. As a demonstration species we chose Atlantic salmon, which has a sequenced genome with poor annotation and an added complexity due to many duplicated genes. We find that our RNA-seq analysis pipeline deciphers between duplicates despite high sequence identity. However, annotation issues provide problems in linking differentially expressed genes to pathways. Also, comparing results between approaches and species are complicated due to lack of standardized annotation. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Exome Sequencing Analysis Reveals Variants in Primary Immunodeficiency Genes in Patients With Very Early Onset Inflammatory Bowel Disease

    PubMed Central

    Kelsen, Judith R.; Dawany, Noor; Moran, Christopher J.; Petersen, Britt-Sabina; Sarmady, Mahdi; Sasson, Ariella; Pauly-Hubbard, Helen; Martinez, Alejandro; Maurer, Kelly; Soong, Joanne; Rappaport, Eric; Franke, Andre; Keller, Andreas; Winter, Harland S.; Mamula, Petar; Piccoli, David; Artis, David; Sonnenberg, Gregory F.; Daly, Mark; Sullivan, Kathleen E.; Baldassano, Robert N.; Devoto, Marcella

    2016-01-01

    Background & Aims Very early onset inflammatory bowel disease (VEO-IBD), IBD diagnosed ≤5 y of age, frequently presents with a different and more severe phenotype than older-onset IBD. We investigated whether patients with VEO-IBD carry rare or novel variants in genes associated with immunodeficiencies that might contribute to disease development. Methods Patients with VEO-IBD and parents (when available) were recruited from the Children's Hospital of Philadelphia from March 2013 through July 2014. We analyzed DNA from 125 patients with VEO-IBD (ages 3 weeks to 4 y) and 19 parents, 4 of whom also had IBD. Exome capture was performed by Agilent SureSelect V4, and sequencing was performed using the Illumina HiSeq platform. Alignment to human genome GRCh37 was achieved followed by post-processing and variant calling. Following functional annotation, candidate variants were analyzed for change in protein function, minor allele frequency <0.1%, and scaled combined annotation dependent depletion scores ≤10. We focused on genes associated with primary immunodeficiencies and related pathways. An additional 210 exome samples from patients with pediatric IBD (n=45) or adult-onset Crohn's disease (n=20) and healthy individuals (controls, n=145) were obtained from the University of Kiel, Germany and used as control groups. Results Four-hundred genes and regions associated with primary immunodeficiency, covering approximately 6500 coding exons totaling > 1 Mbp of coding sequence, were selected from the whole exome data. Our analysis revealed novel and rare variants within these genes that could contribute to the development of VEO-IBD, including rare heterozygous missense variants in IL10RA and previously unidentified variants in MSH5 and CD19. Conclusions In an exome sequence analysis of patients with VEO-IBD and their parents, we identified variants in genes that regulate B- and T-cell functions and could contribute to pathogenesis. Our analysis could lead to the identification of previously unidentified IBD-associated variants. PMID:26193622

  2. Automated Gene Ontology annotation for anonymous sequence data.

    PubMed

    Hennig, Steffen; Groth, Detlef; Lehrach, Hans

    2003-07-01

    Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.

  3. A manually annotated Actinidia chinensis var. chinensis (kiwifruit) genome highlights the challenges associated with draft genomes and gene prediction in plants.

    PubMed

    Pilkington, Sarah M; Crowhurst, Ross; Hilario, Elena; Nardozza, Simona; Fraser, Lena; Peng, Yongyan; Gunaseelan, Kularajathevan; Simpson, Robert; Tahir, Jibran; Deroles, Simon C; Templeton, Kerry; Luo, Zhiwei; Davy, Marcus; Cheng, Canhong; McNeilage, Mark; Scaglione, Davide; Liu, Yifei; Zhang, Qiong; Datson, Paul; De Silva, Nihal; Gardiner, Susan E; Bassett, Heather; Chagné, David; McCallum, John; Dzierzon, Helge; Deng, Cecilia; Wang, Yen-Yi; Barron, Lorna; Manako, Kelvina; Bowen, Judith; Foster, Toshi M; Erridge, Zoe A; Tiffin, Heather; Waite, Chethi N; Davies, Kevin M; Grierson, Ella P; Laing, William A; Kirk, Rebecca; Chen, Xiuyin; Wood, Marion; Montefiori, Mirco; Brummell, David A; Schwinn, Kathy E; Catanach, Andrew; Fullerton, Christina; Li, Dawei; Meiyalaghan, Sathiyamoorthy; Nieuwenhuizen, Niels; Read, Nicola; Prakash, Roneel; Hunter, Don; Zhang, Huaibi; McKenzie, Marian; Knäbel, Mareike; Harris, Alastair; Allan, Andrew C; Gleave, Andrew; Chen, Angela; Janssen, Bart J; Plunkett, Blue; Ampomah-Dwamena, Charles; Voogd, Charlotte; Leif, Davin; Lafferty, Declan; Souleyre, Edwige J F; Varkonyi-Gasic, Erika; Gambi, Francesco; Hanley, Jenny; Yao, Jia-Long; Cheung, Joey; David, Karine M; Warren, Ben; Marsh, Ken; Snowden, Kimberley C; Lin-Wang, Kui; Brian, Lara; Martinez-Sanchez, Marcela; Wang, Mindy; Ileperuma, Nadeesha; Macnee, Nikolai; Campin, Robert; McAtee, Peter; Drummond, Revel S M; Espley, Richard V; Ireland, Hilary S; Wu, Rongmei; Atkinson, Ross G; Karunairetnam, Sakuntala; Bulley, Sean; Chunkath, Shayhan; Hanley, Zac; Storey, Roy; Thrimawithana, Amali H; Thomson, Susan; David, Charles; Testolin, Raffaele; Huang, Hongwen; Hellens, Roger P; Schaffer, Robert J

    2018-04-16

    Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) 'Hongyang' draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models. A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within 'Hongyang' The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned 'Hort16A' cDNAs and comparing with the predicted protein models for Red5 and both the original 'Hongyang' assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised 'Hongyang' annotation, respectively, compared with 90.9% to the Red5 models. Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.

  4. A Putative Gene Cluster from a Lyngbya wollei Bloom that Encodes Paralytic Shellfish Toxin Biosynthesis

    PubMed Central

    Mihali, Troco K.; Carmichael, Wayne W.; Neilan, Brett A.

    2011-01-01

    Saxitoxin and its analogs cause the paralytic shellfish-poisoning syndrome, adversely affecting human health and coastal shellfish industries worldwide. Here we report the isolation, sequencing, annotation, and predicted pathway of the saxitoxin biosynthetic gene cluster in the cyanobacterium Lyngbya wollei. The gene cluster spans 36 kb and encodes enzymes for the biosynthesis and export of the toxins. The Lyngbya wollei saxitoxin gene cluster differs from previously identified saxitoxin clusters as it contains genes that are unique to this cluster, whereby the carbamoyltransferase is truncated and replaced by an acyltransferase, explaining the unique toxin profile presented by Lyngbya wollei. These findings will enable the creation of toxin probes, for water monitoring purposes, as well as proof-of-concept for the combinatorial biosynthesis of these natural occurring alkaloids for the production of novel, biologically active compounds. PMID:21347365

  5. Multiplex Ligation-Dependent Probe Amplification Analysis on Capillary Electrophoresis Instruments for a Rapid Gene Copy Number Study

    PubMed Central

    Jankowski, Stéphane; Currie-Fraser, Erica; Xu, Licen; Coffa, Jordy

    2008-01-01

    Annotated DNA samples that had been previously analyzed were tested using multiplex ligation-dependent probe amplification (MLPA) assays containing probes targeting BRCA1, BRCA2, and MMR (MLH1/MSH2 genes) and the 9p21 chromosomal region. MLPA polymerase chain reaction products were separated on a capillary electrophoresis platform, and the data were analyzed using GeneMapper v4.0 software (Applied Biosystems, Foster City, CA). After signal normalization, loci regions that had undergone deletions or duplications were identified using the GeneMapper Report Manager and verified using the DyeScale functionality. The results highlight an easy-to-use, optimal sample preparation and analysis workflow that can be used for both small- and large-scale studies. PMID:19137113

  6. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

    PubMed Central

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

  7. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

    PubMed

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.

  8. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).

  9. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks.

    PubMed

    Blatti, Charles; Sinha, Saurabh

    2016-07-15

    Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or 'properties' such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. blatti@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  10. Construction and EST sequencing of full-length, drought stress cDNA libraries for common beans (Phaseolus vulgaris L.)

    PubMed Central

    2011-01-01

    Background Common bean is an important legume crop with only a moderate number of short expressed sequence tags (ESTs) made with traditional methods. The goal of this research was to use full-length cDNA technology to develop ESTs that would overlap with the beginning of open reading frames and therefore be useful for gene annotation of genomic sequences. The library was also constructed to represent genes expressed under drought, low soil phosphorus and high soil aluminum toxicity. We also undertook comparisons of the full-length cDNA library to two previous non-full clone EST sets for common bean. Results Two full-length cDNA libraries were constructed: one for the drought tolerant Mesoamerican genotype BAT477 and the other one for the acid-soil tolerant Andean genotype G19833 which has been selected for genome sequencing. Plants were grown in three soil types using deep rooting cylinders subjected to drought and non-drought stress and tissues were collected from both roots and above ground parts. A total of 20,000 clones were selected robotically, half from each library. Then, nearly 10,000 clones from the G19833 library were sequenced with an average read length of 850 nucleotides. A total of 4,219 unigenes were identified consisting of 2,981 contigs and 1,238 singletons. These were functionally annotated with gene ontology terms and placed into KEGG pathways. Compared to other EST sequencing efforts in common bean, about half of the sequences were novel or represented the 5' ends of known genes. Conclusions The present full-length cDNA libraries add to the technological toolbox available for common bean and our sequencing of these clones substantially increases the number of unique EST sequences available for the common bean genome. All of this should be useful for both functional gene annotation, analysis of splice site variants and intron/exon boundary determination by comparison to soybean genes or with common bean whole-genome sequences. In addition the library has a large number of transcription factors and will be interesting for discovery and validation of drought or abiotic stress related genes in common bean. PMID:22118559

  11. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger.

    PubMed

    Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J

    2009-02-04

    Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.

  12. Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation

    PubMed Central

    Wegrzyn, Jill L.; Liechty, John D.; Stevens, Kristian A.; Wu, Le-Shin; Loopstra, Carol A.; Vasquez-Gross, Hans A.; Dougherty, William M.; Lin, Brian Y.; Zieve, Jacob J.; Martínez-García, Pedro J.; Holt, Carson; Yandell, Mark; Zimin, Aleksey V.; Yorke, James A.; Crepeau, Marc W.; Puiu, Daniela; Salzberg, Steven L.; de Jong, Pieter J.; Mockaitis, Keithanne; Main, Doreen; Langley, Charles H.; Neale, David B.

    2014-01-01

    The largest genus in the conifer family Pinaceae is Pinus, with over 100 species. The size and complexity of their genomes (∼20–40 Gb, 2n = 24) have delayed the arrival of a well-annotated reference sequence. In this study, we present the annotation of the first whole-genome shotgun assembly of loblolly pine (Pinus taeda L.), which comprises 20.1 Gb of sequence. The MAKER-P annotation pipeline combined evidence-based alignments and ab initio predictions to generate 50,172 gene models, of which 15,653 are classified as high confidence. Clustering these gene models with 13 other plant species resulted in 20,646 gene families, of which 1554 are predicted to be unique to conifers. Among the conifer gene families, 159 are composed exclusively of loblolly pine members. The gene models for loblolly pine have the highest median and mean intron lengths of 24 fully sequenced plant genomes. Conifer genomes are full of repetitive DNA, with the most significant contributions from long-terminal-repeat retrotransposons. In depth analysis of the tandem and interspersed repetitive content yielded a combined estimate of 82%. PMID:24653211

  13. Construction of ontology augmented networks for protein complex prediction.

    PubMed

    Zhang, Yijia; Lin, Hongfei; Yang, Zhihao; Wang, Jian

    2013-01-01

    Protein complexes are of great importance in understanding the principles of cellular organization and function. The increase in available protein-protein interaction data, gene ontology and other resources make it possible to develop computational methods for protein complex prediction. Most existing methods focus mainly on the topological structure of protein-protein interaction networks, and largely ignore the gene ontology annotation information. In this article, we constructed ontology augmented networks with protein-protein interaction data and gene ontology, which effectively unified the topological structure of protein-protein interaction networks and the similarity of gene ontology annotations into unified distance measures. After constructing ontology augmented networks, a novel method (clustering based on ontology augmented networks) was proposed to predict protein complexes, which was capable of taking into account the topological structure of the protein-protein interaction network, as well as the similarity of gene ontology annotations. Our method was applied to two different yeast protein-protein interaction datasets and predicted many well-known complexes. The experimental results showed that (i) ontology augmented networks and the unified distance measure can effectively combine the structure closeness and gene ontology annotation similarity; (ii) our method is valuable in predicting protein complexes and has higher F1 and accuracy compared to other competing methods.

  14. First generation annotations for the fathead minnow (Pimephales promelas) genome

    EPA Science Inventory

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minno...

  15. Construction of an annotated corpus to support biomedical information extraction

    PubMed Central

    Thompson, Paul; Iqbal, Syed A; McNaught, John; Ananiadou, Sophia

    2009-01-01

    Background Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes. PMID:19852798

  16. Comparative Reannotation of 21 Aspergillus Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Salamov, Asaf; Riley, Robert; Kuo, Alan

    2013-03-08

    We used comparative gene modeling to reannotate 21 Aspergillus genomes. Initial automatic annotation of individual genomes may contain some errors of different nature, e.g. missing genes, incorrect exon-intron structures, 'chimeras', which fuse 2 or more real genes or alternatively splitting some real genes into 2 or more models. The main premise behind the comparative modeling approach is that for closely related genomes most orthologous families have the same conserved gene structure. The algorithm maps all gene models predicted in each individual Aspergillus genome to the other genomes and, for each locus, selects from potentially many competing models, the one whichmore » most closely resembles the orthologous genes from other genomes. This procedure is iterated until no further change in gene models is observed. For Aspergillus genomes we predicted in total 4503 new gene models ( ~;;2percent per genome), supported by comparative analysis, additionally correcting ~;;18percent of old gene models. This resulted in a total of 4065 more genes with annotated PFAM domains (~;;3percent increase per genome). Analysis of a few genomes with EST/transcriptomics data shows that the new annotation sets also have a higher number of EST-supported splice sites at exon-intron boundaries.« less

  17. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  18. Bioinformatics for spermatogenesis: annotation of male reproduction based on proteomics

    PubMed Central

    Zhou, Tao; Zhou, Zuo-Min; Guo, Xue-Jiang

    2013-01-01

    Proteomics strategies have been widely used in the field of male reproduction, both in basic and clinical research. Bioinformatics methods are indispensable in proteomics-based studies and are used for data presentation, database construction and functional annotation. In the present review, we focus on the functional annotation of gene lists obtained through qualitative or quantitative methods, summarizing the common and male reproduction specialized proteomics databases. We introduce several integrated tools used to find the hidden biological significance from the data obtained. We further describe in detail the information on male reproduction derived from Gene Ontology analyses, pathway analyses and biomedical analyses. We provide an overview of bioinformatics annotations in spermatogenesis, from gene function to biological function and from biological function to clinical application. On the basis of recently published proteomics studies and associated data, we show that bioinformatics methods help us to discover drug targets for sperm motility and to scan for cancer-testis genes. In addition, we summarize the online resources relevant to male reproduction research for the exploration of the regulation of spermatogenesis. PMID:23852026

  19. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

    PubMed

    Paul, Sandip; Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2015-12-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing the pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen - a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for a species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars - Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. Copyright © 2015 Elsevier Inc. All rights reserved.

  20. FusionHub: A unified web platform for annotation and visualization of gene fusion events in human cancer.

    PubMed

    Panigrahi, Priyabrata; Jere, Abhay; Anamika, Krishanpal

    2018-01-01

    Gene fusion is a chromosomal rearrangement event which plays a significant role in cancer due to the oncogenic potential of the chimeric protein generated through fusions. At present many databases are available in public domain which provides detailed information about known gene fusion events and their functional role. Existing gene fusion detection tools, based on analysis of transcriptomics data usually report a large number of fusion genes as potential candidates, which could be either known or novel or false positives. Manual annotation of these putative genes is indeed time-consuming. We have developed a web platform FusionHub, which acts as integrated search engine interfacing various fusion gene databases and simplifies large scale annotation of fusion genes in a seamless way. In addition, FusionHub provides three ways of visualizing fusion events: circular view, domain architecture view and network view. Design of potential siRNA molecules through ensemble method is another utility integrated in FusionHub that could aid in siRNA-based targeted therapy. FusionHub is freely available at https://fusionhub.persistent.co.in.

  1. The GermOnline cross-species systems browser provides comprehensive information on genes and gene products relevant for sexual reproduction.

    PubMed

    Gattiker, Alexandre; Niederhauser-Wiederkehr, Christa; Moore, James; Hermida, Leandro; Primig, Michael

    2007-01-01

    We report a novel release of the GermOnline knowledgebase covering genes relevant for the cell cycle, gametogenesis and fertility. GermOnline was extended into a cross-species systems browser including information on DNA sequence annotation, gene expression and the function of gene products. The database covers eight model organisms and Homo sapiens, for which complete genome annotation data are available. The database is now built around a sophisticated genome browser (Ensembl), our own microarray information management and annotation system (MIMAS) used to extensively describe experimental data obtained with high-density oligonucleotide microarrays (GeneChips) and a comprehensive system for online editing of database entries (MediaWiki). The RNA data include results from classical microarrays as well as tiling arrays that yield information on RNA expression levels, transcript start sites and lengths as well as exon composition. Members of the research community are solicited to help GermOnline curators keep database entries on genes and gene products complete and accurate. The database is accessible at http://www.germonline.org/.

  2. MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics

    PubMed Central

    Schoof, Heiko; Ernst, Rebecca; Nazarov, Vladimir; Pfeifer, Lukas; Mewes, Hans-Werner; Mayer, Klaus F. X.

    2004-01-01

    Arabidopsis thaliana is the most widely studied model plant. Functional genomics is intensively underway in many laboratories worldwide. Beyond the basic annotation of the primary sequence data, the annotated genetic elements of Arabidopsis must be linked to diverse biological data and higher order information such as metabolic or regulatory pathways. The MIPS Arabidopsis thaliana database MAtDB aims to provide a comprehensive resource for Arabidopsis as a genome model that serves as a primary reference for research in plants and is suitable for transfer of knowledge to other plants, especially crops. The genome sequence as a common backbone serves as a scaffold for the integration of data, while, in a complementary effort, these data are enhanced through the application of state-of-the-art bioinformatics tools. This information is visualized on a genome-wide and a gene-by-gene basis with access both for web users and applications. This report updates the information given in a previous report and provides an outlook on further developments. The MAtDB web interface can be accessed at http://mips.gsf.de/proj/thal/db. PMID:14681437

  3. EuroPineDB: a high-coverage web database for maritime pine transcriptome

    PubMed Central

    2011-01-01

    Background Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre-processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome. PMID:21762488

  4. Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

    PubMed Central

    Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629

  5. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    PubMed

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  6. Fault tolerance in protein interaction networks: stable bipartite subgraphs and redundant pathways.

    PubMed

    Brady, Arthur; Maxwell, Kyle; Daniels, Noah; Cowen, Lenore J

    2009-01-01

    As increasing amounts of high-throughput data for the yeast interactome become available, more system-wide properties are uncovered. One interesting question concerns the fault tolerance of protein interaction networks: whether there exist alternative pathways that can perform some required function if a gene essential to the main mechanism is defective, absent or suppressed. A signature pattern for redundant pathways is the BPM (between-pathway model) motif, introduced by Kelley and Ideker. Past methods proposed to search the yeast interactome for BPM motifs have had several important limitations. First, they have been driven heuristically by local greedy searches, which can lead to the inclusion of extra genes that may not belong in the motif; second, they have been validated solely by functional coherence of the putative pathways using GO enrichment, making it difficult to evaluate putative BPMs in the absence of already known biological annotation. We introduce stable bipartite subgraphs, and show they form a clean and efficient way of generating meaningful BPMs which naturally discard extra genes included by local greedy methods. We show by GO enrichment measures that our BPM set outperforms previous work, covering more known complexes and functional pathways. Perhaps most importantly, since our BPMs are initially generated by examining the genetic-interaction network only, the location of edges in the protein-protein physical interaction network can then be used to statistically validate each candidate BPM, even with sparse GO annotation (or none at all). We uncover some interesting biological examples of previously unknown putative redundant pathways in such areas as vesicle-mediated transport and DNA repair.

  7. Fault Tolerance in Protein Interaction Networks: Stable Bipartite Subgraphs and Redundant Pathways

    PubMed Central

    Brady, Arthur; Maxwell, Kyle; Daniels, Noah; Cowen, Lenore J.

    2009-01-01

    As increasing amounts of high-throughput data for the yeast interactome become available, more system-wide properties are uncovered. One interesting question concerns the fault tolerance of protein interaction networks: whether there exist alternative pathways that can perform some required function if a gene essential to the main mechanism is defective, absent or suppressed. A signature pattern for redundant pathways is the BPM (between-pathway model) motif, introduced by Kelley and Ideker. Past methods proposed to search the yeast interactome for BPM motifs have had several important limitations. First, they have been driven heuristically by local greedy searches, which can lead to the inclusion of extra genes that may not belong in the motif; second, they have been validated solely by functional coherence of the putative pathways using GO enrichment, making it difficult to evaluate putative BPMs in the absence of already known biological annotation. We introduce stable bipartite subgraphs, and show they form a clean and efficient way of generating meaningful BPMs which naturally discard extra genes included by local greedy methods. We show by GO enrichment measures that our BPM set outperforms previous work, covering more known complexes and functional pathways. Perhaps most importantly, since our BPMs are initially generated by examining the genetic-interaction network only, the location of edges in the protein-protein physical interaction network can then be used to statistically validate each candidate BPM, even with sparse GO annotation (or none at all). We uncover some interesting biological examples of previously unknown putative redundant pathways in such areas as vesicle-mediated transport and DNA repair. PMID:19399174

  8. Premature terminator analysis sheds light on a hidden world of bacterial transcriptional attenuation.

    PubMed

    Naville, Magali; Gautheret, Daniel

    2010-01-01

    Bacterial transcription attenuation occurs through a variety of cis-regulatory elements that control gene expression in response to a wide range of signals. The signal-sensing structures in attenuators are so diverse and rapidly evolving that only a small fraction have been properly annotated and characterized to date. Here we apply a broad-spectrum detection tool in order to achieve a more complete view of the transcriptional attenuation complement of key bacterial species. Our protocol seeks gene families with an unusual frequency of 5' terminators found across multiple species. Many of the detected attenuators are part of annotated elements, such as riboswitches or T-boxes, which often operate through transcriptional attenuation. However, a significant fraction of candidates were not previously characterized in spite of their unmistakable footprint. We further characterized some of these new elements using sequence and secondary structure analysis. We also present elements that may control the expression of several non-homologous genes, suggesting co-transcription and response to common signals. An important class of such elements, which we called mobile attenuators, is provided by 3' terminators of insertion sequences or prophages that may be exapted as 5' regulators when inserted directly upstream of a cellular gene. We show here that attenuators involve a complex landscape of signal-detection structures spanning the entire bacterial domain. We discuss possible scenarios through which these diverse 5' regulatory structures may arise or evolve.

  9. Draft genome of the American Eel (Anguilla rostrata).

    PubMed

    Pavey, Scott A; Laporte, Martin; Normandeau, Eric; Gaudin, Jérémy; Letourneau, Louis; Boisvert, Sébastien; Corbeil, Jacques; Audet, Céline; Bernatchez, Louis

    2017-07-01

    Freshwater eels (Anguilla sp.) have large economic, cultural, ecological and aesthetic importance worldwide, but they suffered more than 90% decline in global stocks over the past few decades. Proper genetic resources, such as sequenced, assembled and annotated genomes, are essential to help plan sustainable recoveries by identifying physiological, biochemical and genetic mechanisms that caused the declines or that may lead to recoveries. Here, we present the first sequenced genome of the American eel. This genome contained 305 043 contigs (N50 = 7397) and 79 209 scaffolds (N50 = 86 641) for a total size of 1.41 Gb, which is in the middle of the range of previous estimations for this species. In addition, protein-coding regions, including introns and flanking regions, are very well represented in the genome, as 95.2% of the 458 core eukaryotic genes and 98.8% of the 248 ultra-conserved subset were represented in the assembly and a total of 26 564 genes were annotated for future functional genomics studies. We performed a candidate gene analysis to compare three genes among all three freshwater eel species and, congruent with the phylogenetic relationships, Japanese eel (A. japanica) exhibited the most divergence. Overall, the sequenced genome presented in this study is a crucial addition to the presently available genetic tools to help guide future conservation efforts of freshwater eels. © 2016 John Wiley & Sons Ltd.

  10. The UCSC Genome Browser database: extensions and updates 2013.

    PubMed

    Meyer, Laurence R; Zweig, Ann S; Hinrichs, Angie S; Karolchik, Donna; Kuhn, Robert M; Wong, Matthew; Sloan, Cricket A; Rosenbloom, Kate R; Roe, Greg; Rhead, Brooke; Raney, Brian J; Pohl, Andy; Malladi, Venkat S; Li, Chin H; Lee, Brian T; Learned, Katrina; Kirkup, Vanessa; Hsu, Fan; Heitner, Steve; Harte, Rachel A; Haeussler, Maximilian; Guruvadoo, Luvina; Goldman, Mary; Giardine, Belinda M; Fujita, Pauline A; Dreszer, Timothy R; Diekhans, Mark; Cline, Melissa S; Clawson, Hiram; Barber, Galt P; Haussler, David; Kent, W James

    2013-01-01

    The University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) offers online public access to a growing database of genomic sequence and annotations for a wide variety of organisms. The Browser is an integrated tool set for visualizing, comparing, analysing and sharing both publicly available and user-generated genomic datasets. As of September 2012, genomic sequence and a basic set of annotation 'tracks' are provided for 63 organisms, including 26 mammals, 13 non-mammal vertebrates, 3 invertebrate deuterostomes, 13 insects, 6 worms, yeast and sea hare. In the past year 19 new genome assemblies have been added, and we anticipate releasing another 28 in early 2013. Further, a large number of annotation tracks have been either added, updated by contributors or remapped to the latest human reference genome. Among these are an updated UCSC Genes track for human and mouse assemblies. We have also introduced several features to improve usability, including new navigation menus. This article provides an update to the UCSC Genome Browser database, which has been previously featured in the Database issue of this journal.

  11. An Updated Collection of Sequence Barcoded Temperature-Sensitive Alleles of Yeast Essential Genes

    PubMed Central

    Kofoed, Megan; Milbury, Karissa L.; Chiang, Jennifer H.; Sinha, Sunita; Ben-Aroya, Shay; Giaever, Guri; Nislow, Corey; Hieter, Philip; Stirling, Peter C.

    2015-01-01

    Systematic analyses of essential gene function using mutant collections in Saccharomyces cerevisiae have been conducted using collections of heterozygous diploids, promoter shut-off alleles, through alleles with destabilized mRNA, destabilized protein, or bearing mutations that lead to a temperature-sensitive (ts) phenotype. We previously described a method for construction of barcoded ts alleles in a systematic fashion. Here we report the completion of this collection of alleles covering 600 essential yeast genes. This resource covers a larger gene repertoire than previous collections and provides a complementary set of strains suitable for single gene and genomic analyses. We use deep sequencing to characterize the amino acid changes leading to the ts phenotype in half of the alleles. We also use high-throughput approaches to describe the relative ts behavior of the alleles. Finally, we demonstrate the experimental usefulness of the collection in a high-content, functional genomic screen for ts alleles that increase spontaneous P-body formation. By increasing the number of alleles and improving the annotation, this ts collection will serve as a community resource for probing new aspects of biology for essential yeast genes. PMID:26175450

  12. An Updated Collection of Sequence Barcoded Temperature-Sensitive Alleles of Yeast Essential Genes.

    PubMed

    Kofoed, Megan; Milbury, Karissa L; Chiang, Jennifer H; Sinha, Sunita; Ben-Aroya, Shay; Giaever, Guri; Nislow, Corey; Hieter, Philip; Stirling, Peter C

    2015-07-14

    Systematic analyses of essential gene function using mutant collections in Saccharomyces cerevisiae have been conducted using collections of heterozygous diploids, promoter shut-off alleles, through alleles with destabilized mRNA, destabilized protein, or bearing mutations that lead to a temperature-sensitive (ts) phenotype. We previously described a method for construction of barcoded ts alleles in a systematic fashion. Here we report the completion of this collection of alleles covering 600 essential yeast genes. This resource covers a larger gene repertoire than previous collections and provides a complementary set of strains suitable for single gene and genomic analyses. We use deep sequencing to characterize the amino acid changes leading to the ts phenotype in half of the alleles. We also use high-throughput approaches to describe the relative ts behavior of the alleles. Finally, we demonstrate the experimental usefulness of the collection in a high-content, functional genomic screen for ts alleles that increase spontaneous P-body formation. By increasing the number of alleles and improving the annotation, this ts collection will serve as a community resource for probing new aspects of biology for essential yeast genes. Copyright © 2015 Kofoed et al.

  13. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    USDA-ARS?s Scientific Manuscript database

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  14. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach.

    PubMed

    Peng, Jiajie; Zhang, Xuanshuo; Hui, Weiwei; Lu, Junya; Li, Qianqian; Liu, Shuhui; Shang, Xuequn

    2018-03-19

    Gene Ontology (GO) is one of the most popular bioinformatics resources. In the past decade, Gene Ontology-based gene semantic similarity has been effectively used to model gene-to-gene interactions in multiple research areas. However, most existing semantic similarity approaches rely only on GO annotations and structure, or incorporate only local interactions in the co-functional network. This may lead to inaccurate GO-based similarity resulting from the incomplete GO topology structure and gene annotations. We present NETSIM2, a new network-based method that allows researchers to measure GO-based gene functional similarities by considering the global structure of the co-functional network with a random walk with restart (RWR)-based method, and by selecting the significant term pairs to decrease the noise information. Based on the EC number (Enzyme Commission)-based groups of yeast and Arabidopsis, evaluation test shows that NETSIM2 can enhance the accuracy of Gene Ontology-based gene functional similarity. Using NETSIM2 as an example, we found that the accuracy of semantic similarities can be significantly improved after effectively incorporating the global gene-to-gene interactions in the co-functional network, especially on the species that gene annotations in GO are far from complete.

  15. Using hidden Markov models and observed evolution to annotate viral genomes.

    PubMed

    McCauley, Stephen; Hein, Jotun

    2006-06-01

    ssRNA (single stranded) viral genomes are generally constrained in length and utilize overlapping reading frames to maximally exploit the coding potential within the genome length restrictions. This overlapping coding phenomenon leads to complex evolutionary constraints operating on the genome. In regions which code for more than one protein, silent mutations in one reading frame generally have a protein coding effect in another. To maximize coding flexibility in all reading frames, overlapping regions are often compositionally biased towards amino acids which are 6-fold degenerate with respect to the 64 codon alphabet. Previous methodologies have used this fact in an ad hoc manner to look for overlapping genes by motif matching. In this paper differentiated nucleotide compositional patterns in overlapping regions are incorporated into a probabilistic hidden Markov model (HMM) framework which is used to annotate ssRNA viral genomes. This work focuses on single sequence annotation and applies an HMM framework to ssRNA viral annotation. A description of how the HMM is parameterized, whilst annotating within a missing data framework is given. A Phylogenetic HMM (Phylo-HMM) extension, as applied to 14 aligned HIV2 sequences is also presented. This evolutionary extension serves as an illustration of the potential of the Phylo-HMM framework for ssRNA viral genomic annotation. The single sequence annotation procedure (SSA) is applied to 14 different strains of the HIV2 virus. Further results on alternative ssRNA viral genomes are presented to illustrate more generally the performance of the method. The results of the SSA method are encouraging however there is still room for improvement, and since there is overwhelming evidence to indicate that comparative methods can improve coding sequence (CDS) annotation, the SSA method is extended to a Phylo-HMM to incorporate evolutionary information. The Phylo-HMM extension is applied to the same set of 14 HIV2 sequences which are pre-aligned. The performance improvement that results from including the evolutionary information in the analysis is illustrated.

  16. Computational annotation of genes differentially expressed along olive fruit development

    PubMed Central

    Galla, Giulio; Barcaccia, Gianni; Ramina, Angelo; Collani, Silvio; Alagna, Fiammetta; Baldoni, Luciana; Cultrera, Nicolò GM; Martinelli, Federico; Sebastiani, Luca; Tonutti, Pietro

    2009-01-01

    Background Olea europaea L. is a traditional tree crop of the Mediterranean basin with a worldwide economical high impact. Differently from other fruit tree species, little is known about the physiological and molecular basis of the olive fruit development and a few sequences of genes and gene products are available for olive in public databases. This study deals with the identification of large sets of differentially expressed genes in developing olive fruits and the subsequent computational annotation by means of different software. Results mRNA from fruits of the cv. Leccino sampled at three different stages [i.e., initial fruit set (stage 1), completed pit hardening (stage 2) and veraison (stage 3)] was used for the identification of differentially expressed genes putatively involved in main processes along fruit development. Four subtractive hybridization libraries were constructed: forward and reverse between stage 1 and 2 (libraries A and B), and 2 and 3 (libraries C and D). All sequenced clones (1,132 in total) were analyzed through BlastX against non-redundant NCBI databases and about 60% of them showed similarity to known proteins. A total of 89 out of 642 differentially expressed unique sequences was further investigated by Real-Time PCR, showing a validation of the SSH results as high as 69%. Library-specific cDNA repertories were annotated according to the three main vocabularies of the gene ontology (GO): cellular component, biological process and molecular function. BlastX analysis, GO terms mapping and annotation analysis were performed using the Blast2GO software, a research tool designed with the main purpose of enabling GO based data mining on sequence sets for which no GO annotation is yet available. Bioinformatic analysis pointed out a significantly different distribution of the annotated sequences for each GO category, when comparing the three fruit developmental stages. The olive fruit-specific transcriptome dataset was used to query all known KEGG (Kyoto Encyclopaedia of Genes and Genomes) metabolic pathways for characterizing and positioning retrieved EST records. The integration of the olive sequence datasets within the MapMan platform for microarray analysis allowed the identification of specific biosynthetic pathways useful for the definition of key functional categories in time course analyses for gene groups. Conclusion The bioinformatic annotation of all gene sequences was useful to shed light on metabolic pathways and transcriptional aspects related to carbohydrates, fatty acids, secondary metabolites, transcription factors and hormones as well as response to biotic and abiotic stresses throughout olive drupe development. These results represent a first step toward both functional genomics and systems biology research for understanding the gene functions and regulatory networks in olive fruit growth and ripening. PMID:19852839

  17. Ontology design patterns to disambiguate relations between genes and gene products in GENIA

    PubMed Central

    2011-01-01

    Motivation Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences. Results We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications. Availability Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/. PMID:22166341

  18. Exploratory Visual Analysis of Statistical Results from Microarray Experiments Comparing High and Low Grade Glioma

    PubMed Central

    Reif, David M.; Israel, Mark A.; Moore, Jason H.

    2007-01-01

    The biological interpretation of gene expression microarray results is a daunting challenge. For complex diseases such as cancer, wherein the body of published research is extensive, the incorporation of expert knowledge provides a useful analytical framework. We have previously developed the Exploratory Visual Analysis (EVA) software for exploring data analysis results in the context of annotation information about each gene, as well as biologically relevant groups of genes. We present EVA as a flexible combination of statistics and biological annotation that provides a straightforward visual interface for the interpretation of microarray analyses of gene expression in the most commonly occuring class of brain tumors, glioma. We demonstrate the utility of EVA for the biological interpretation of statistical results by analyzing publicly available gene expression profiles of two important glial tumors. The results of a statistical comparison between 21 malignant, high-grade glioblastoma multiforme (GBM) tumors and 19 indolent, low-grade pilocytic astrocytomas were analyzed using EVA. By using EVA to examine the results of a relatively simple statistical analysis, we were able to identify tumor class-specific gene expression patterns having both statistical and biological significance. Our interactive analysis highlighted the potential importance of genes involved in cell cycle progression, proliferation, signaling, adhesion, migration, motility, and structure, as well as candidate gene loci on a region of Chromosome 7 that has been implicated in glioma. Because EVA does not require statistical or computational expertise and has the flexibility to accommodate any type of statistical analysis, we anticipate EVA will prove a useful addition to the repertoire of computational methods used for microarray data analysis. EVA is available at no charge to academic users and can be found at http://www.epistasis.org. PMID:19390666

  19. Late Multiple Organ Surge in Interferon-Regulated Target Genes Characterizes Staphylococcal Enterotoxin B Lethality

    PubMed Central

    Ferreyra, Gabriela A.; Elinoff, Jason M.; Demirkale, Cumhur Y.; Starost, Matthew F.; Buckley, Marilyn; Munson, Peter J.; Krakauer, Teresa; Danner, Robert L.

    2014-01-01

    Background Bacterial superantigens are virulence factors that cause toxic shock syndrome. Here, the genome-wide, temporal response of mice to lethal intranasal staphylococcal enterotoxin B (SEB) challenge was investigated in six tissues. Results The earliest responses and largest number of affected genes occurred in peripheral blood mononuclear cells (PBMC), spleen, and lung tissues with the highest content of both T-cells and monocyte/macrophages, the direct cellular targets of SEB. In contrast, the response of liver, kidney, and heart was delayed and involved fewer genes, but revealed a dominant genetic program that was seen in all 6 tissues. Many of the 85 uniquely annotated transcripts participating in this shared genomic response have not been previously linked to SEB. Nine of the 85 genes were subsequently confirmed by RT-PCR in every tissue/organ at 24 h. These 85 transcripts, up-regulated in all tissues, annotated to the interferon (IFN)/antiviral-response and included genes belonging to the DNA/RNA sensing system, DNA damage repair, the immunoproteasome, and the ER/metabolic stress-response and apoptosis pathways. Overall, this shared program was identified as a type I and II interferon (IFN)-response and the promoters of these genes were highly enriched for IFN regulatory matrices. Several genes whose secreted products induce the IFN pathway were up-regulated at early time points in PBMCs, spleen, and/or lung. Furthermore, IFN regulatory factors including Irf1, Irf7 and Irf8, and Zbp1, a DNA sensor/transcription factor that can directly elicit an IFN innate immune response, participated in this host-wide SEB signature. Conclusion Global gene-expression changes across multiple organs implicated a host-wide IFN-response in SEB-induced death. Therapies aimed at IFN-associated innate immunity may improve outcome in toxic shock syndromes. PMID:24551153

  20. Profiling and Quantifying Differential Gene Transcription Provide Insights into Ganoderic Acid Biosynthesis in Ganoderma lucidum in Response to Methyl Jasmonate

    PubMed Central

    Shi, Liang; Mu, Da-Shuai; Jiang, Ai-Liang; Han, Qin; Zhao, Ming-Wen

    2013-01-01

    Ganoderma lucidum is a mushroom with traditional medicinal properties that has been widely used in China and other countries in Eastern Asia. Ganoderic acids (GA) produced by G. lucidum exhibit important pharmacological activities. Previous studies have demonstrated that methyl jasmonate (MeJA) is a potent inducer of GA biosynthesis and the expression of genes involved in the GA biosynthesis pathway in G. lucidum. To further explore the mechanism of GA biosynthesis, cDNA-Amplified Fragment Length Polymorphism (cDNA-AFLP) was used to identify genes that are differentially expressed in response to MeJA. Using 64 primer combinations, over 3910 transcriptionally derived fragments (TDFs) were obtained. Reliable sequence data were obtained for 390 of 458 selected TDFs. Ninety of these TDFs were annotated with known functions through BLASTX searching the GenBank database, and 12 annotated TDFs were assigned into secondary metabolic pathways by searching the KEGGPATHWAY database. Twenty-five TDFs were selected for qRT-PCR analysis to confirm the expression patterns observed with cDNA-AFLP. The qRT-PCR results were consistent with the altered patterns of gene expression revealed by the cDNA-AFLP technique. Additionally, the transcript levels of 10 genes were measured at the mycelium, primordia, and fruiting body developmental stages of G. lucidum. The greatest expression levels were reached during primordia for all of the genes except cytochrome b2 reached its highest expression level in the mycelium stage. This study not only identifies new candidate genes involved in the regulation of GA biosynthesis but also provides further insight into MeJA-induced gene expression and secondary metabolic response in G. lucidum. PMID:23762280

  1. Network topologies and convergent aetiologies arising from deletions and duplications observed in individuals with autism.

    PubMed

    Noh, Hyun Ji; Ponting, Chris P; Boulding, Hannah C; Meader, Stephen; Betancur, Catalina; Buxbaum, Joseph D; Pinto, Dalila; Marshall, Christian R; Lionel, Anath C; Scherer, Stephen W; Webber, Caleb

    2013-06-01

    Autism Spectrum Disorders (ASD) are highly heritable and characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. Considering four sets of de novo copy number variants (CNVs) identified in 181 individuals with autism and exploiting mouse functional genomics and known protein-protein interactions, we identified a large and significantly interconnected interaction network. This network contains 187 genes affected by CNVs drawn from 45% of the patients we considered and 22 genes previously implicated in ASD, of which 192 form a single interconnected cluster. On average, those patients with copy number changed genes from this network possess changes in 3 network genes, suggesting that epistasis mediated through the network is extensive. Correspondingly, genes that are highly connected within the network, and thus whose copy number change is predicted by the network to be more phenotypically consequential, are significantly enriched among patients that possess only a single ASD-associated network copy number changed gene (p = 0.002). Strikingly, deleted or disrupted genes from the network are significantly enriched in GO-annotated positive regulators (2.3-fold enrichment, corrected p = 2×10(-5)), whereas duplicated genes are significantly enriched in GO-annotated negative regulators (2.2-fold enrichment, corrected p = 0.005). The direction of copy change is highly informative in the context of the network, providing the means through which perturbations arising from distinct deletions or duplications can yield a common outcome. These findings reveal an extensive ASD-associated molecular network, whose topology indicates ASD-relevant mutational deleteriousness and that mechanistically details how convergent aetiologies can result extensively from CNVs affecting pathways causally implicated in ASD.

  2. 'RetinoGenetics': a comprehensive mutation database for genes related to inherited retinal degeneration.

    PubMed

    Ran, Xia; Cai, Wei-Jun; Huang, Xiu-Feng; Liu, Qi; Lu, Fan; Qu, Jia; Wu, Jinyu; Jin, Zi-Bing

    2014-01-01

    Inherited retinal degeneration (IRD), a leading cause of human blindness worldwide, is exceptionally heterogeneous with clinical heterogeneity and genetic variety. During the past decades, tremendous efforts have been made to explore the complex heterogeneity, and massive mutations have been identified in different genes underlying IRD with the significant advancement of sequencing technology. In this study, we developed a comprehensive database, 'RetinoGenetics', which contains informative knowledge about all known IRD-related genes and mutations for IRD. 'RetinoGenetics' currently contains 4270 mutations in 186 genes, with detailed information associated with 164 phenotypes from 934 publications and various types of functional annotations. Then extensive annotations were performed to each gene using various resources, including Gene Ontology, KEGG pathways, protein-protein interaction, mutational annotations and gene-disease network. Furthermore, by using the search functions, convenient browsing ways and intuitive graphical displays, 'RetinoGenetics' could serve as a valuable resource for unveiling the genetic basis of IRD. Taken together, 'RetinoGenetics' is an integrative, informative and updatable resource for IRD-related genetic predispositions. Database URL: http://www.retinogenetics.org/. © The Author(s) 2014. Published by Oxford University Press.

  3. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger

    PubMed Central

    Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J

    2009-01-01

    Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method. PMID:19193216

  4. Using phylogenetically-informed annotation (PIA) to search for light-interacting genes in transcriptomes from non-model organisms.

    PubMed

    Speiser, Daniel I; Pankey, M Sabrina; Zaharoff, Alexander K; Battelle, Barbara A; Bracken-Grissom, Heather D; Breinholt, Jesse W; Bybee, Seth M; Cronin, Thomas W; Garm, Anders; Lindgren, Annie R; Patel, Nipam H; Porter, Megan L; Protas, Meredith E; Rivera, Ajna S; Serb, Jeanne M; Zigler, Kirk S; Crandall, Keith A; Oakley, Todd H

    2014-11-19

    Tools for high throughput sequencing and de novo assembly make the analysis of transcriptomes (i.e. the suite of genes expressed in a tissue) feasible for almost any organism. Yet a challenge for biologists is that it can be difficult to assign identities to gene sequences, especially from non-model organisms. Phylogenetic analyses are one useful method for assigning identities to these sequences, but such methods tend to be time-consuming because of the need to re-calculate trees for every gene of interest and each time a new data set is analyzed. In response, we employed existing tools for phylogenetic analysis to produce a computationally efficient, tree-based approach for annotating transcriptomes or new genomes that we term Phylogenetically-Informed Annotation (PIA), which places uncharacterized genes into pre-calculated phylogenies of gene families. We generated maximum likelihood trees for 109 genes from a Light Interaction Toolkit (LIT), a collection of genes that underlie the function or development of light-interacting structures in metazoans. To do so, we searched protein sequences predicted from 29 fully-sequenced genomes and built trees using tools for phylogenetic analysis in the Osiris package of Galaxy (an open-source workflow management system). Next, to rapidly annotate transcriptomes from organisms that lack sequenced genomes, we repurposed a maximum likelihood-based Evolutionary Placement Algorithm (implemented in RAxML) to place sequences of potential LIT genes on to our pre-calculated gene trees. Finally, we implemented PIA in Galaxy and used it to search for LIT genes in 28 newly-sequenced transcriptomes from the light-interacting tissues of a range of cephalopod mollusks, arthropods, and cubozoan cnidarians. Our new trees for LIT genes are available on the Bitbucket public repository ( http://bitbucket.org/osiris_phylogenetics/pia/ ) and we demonstrate PIA on a publicly-accessible web server ( http://galaxy-dev.cnsi.ucsb.edu/pia/ ). Our new trees for LIT genes will be a valuable resource for researchers studying the evolution of eyes or other light-interacting structures. We also introduce PIA, a high throughput method for using phylogenetic relationships to identify LIT genes in transcriptomes from non-model organisms. With simple modifications, our methods may be used to search for different sets of genes or to annotate data sets from taxa outside of Metazoa.

  5. Evidence-based gene models for structural and functional annotations of the oil palm genome.

    PubMed

    Chan, Kuang-Lim; Tatarinova, Tatiana V; Rosli, Rozana; Amiruddin, Nadzirah; Azizi, Norazah; Halim, Mohd Amin Ab; Sanusi, Nik Shazana Nik Mohd; Jayanthi, Nagappan; Ponomarenko, Petr; Triska, Martin; Solovyev, Victor; Firdaus-Raih, Mohd; Sambanthamurthi, Ravigadevi; Murphy, Denis; Low, Eng-Ti Leslie

    2017-09-08

    Oil palm is an important source of edible oil. The importance of the crop, as well as its long breeding cycle (10-12 years) has led to the sequencing of its genome in 2013 to pave the way for genomics-guided breeding. Nevertheless, the first set of gene predictions, although useful, had many fragmented genes. Classification and characterization of genes associated with traits of interest, such as those for fatty acid biosynthesis and disease resistance, were also limited. Lipid-, especially fatty acid (FA)-related genes are of particular interest for the oil palm as they specify oil yields and quality. This paper presents the characterization of the oil palm genome using different gene prediction methods and comparative genomics analysis, identification of FA biosynthesis and disease resistance genes, and the development of an annotation database and bioinformatics tools. Using two independent gene-prediction pipelines, Fgenesh++ and Seqping, 26,059 oil palm genes with transcriptome and RefSeq support were identified from the oil palm genome. These coding regions of the genome have a characteristic broad distribution of GC 3 (fraction of cytosine and guanine in the third position of a codon) with over half the GC 3 -rich genes (GC 3  ≥ 0.75286) being intronless. In comparison, only one-seventh of the oil palm genes identified are intronless. Using comparative genomics analysis, characterization of conserved domains and active sites, and expression analysis, 42 key genes involved in FA biosynthesis in oil palm were identified. For three of them, namely EgFABF, EgFABH and EgFAD3, segmental duplication events were detected. Our analysis also identified 210 candidate resistance genes in six classes, grouped by their protein domain structures. We present an accurate and comprehensive annotation of the oil palm genome, focusing on analysis of important categories of genes (GC 3 -rich and intronless), as well as those associated with important functions, such as FA biosynthesis and disease resistance. The study demonstrated the advantages of having an integrated approach to gene prediction and developed a computational framework for combining multiple genome annotations. These results, available in the oil palm annotation database ( http://palmxplore.mpob.gov.my ), will provide important resources for studies on the genomes of oil palm and related crops. This article was reviewed by Alexander Kel, Igor Rogozin, and Vladimir A. Kuznetsov.

  6. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

    PubMed Central

    2012-01-01

    Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas. PMID:23256920

  7. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

  8. Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates

    PubMed Central

    2010-01-01

    Background Unitary pseudogenes are a class of unprocessed pseudogenes without functioning counterparts in the genome. They constitute only a small fraction of annotated pseudogenes in the human genome. However, as they represent distinct functional losses over time, they shed light on the unique features of humans in primate evolution. Results We have developed a pipeline to detect human unitary pseudogenes through analyzing the global inventory of orthologs between the human genome and its mammalian relatives. We focus on gene losses along the human lineage after the divergence from rodents about 75 million years ago. In total, we identify 76 unitary pseudogenes, including previously annotated ones, and many novel ones. By comparing each of these to its functioning ortholog in other mammals, we can approximately date the creation of each unitary pseudogene (that is, the gene 'death date') and show that for our group of 76, the functional genes appear to be disabled at a fairly uniform rate throughout primate evolution - not all at once, correlated, for instance, with the 'Alu burst'. Furthermore, we identify 11 unitary pseudogenes that are polymorphic - that is, they have both nonfunctional and functional alleles currently segregating in the human population. Comparing them with their orthologs in other primates, we find that two of them are in fact pseudogenes in non-human primates, suggesting that they represent cases of a gene being resurrected in the human lineage. Conclusions This analysis of unitary pseudogenes provides insights into the evolutionary constraints faced by different organisms and the timescales of functional gene loss in humans. PMID:20210993

  9. The pathway not taken: understanding 'omics data in the perinatal context.

    PubMed

    Edlow, Andrea G; Slonim, Donna K; Wick, Heather C; Hui, Lisa; Bianchi, Diana W

    2015-07-01

    'Omics analysis of large datasets has an increasingly important role in perinatal research, but understanding gene expression analyses in the fetal context remains a challenge. We compared the interpretation provided by a widely used systems biology resource (ingenuity pathway analysis [IPA]) with that from gene set enrichment analysis (GSEA) with functional annotation curated specifically for the fetus (Developmental FunctionaL Annotation at Tufts [DFLAT]). Using amniotic fluid supernatant transcriptome datasets previously produced by our group, we analyzed 3 different developmental perturbations: aneuploidy (Trisomy 21 [T21]), hemodynamic (twin-twin transfusion syndrome [TTTS]), and metabolic (maternal obesity) vs sex- and gestational age-matched control subjects. Differentially expressed probe sets were identified with the use of paired t-tests with the Benjamini-Hochberg correction for multiple testing (P < .05). Functional analyses were performed with IPA and GSEA/DFLAT. Outputs were compared for biologic relevance to the fetus. Compared with control subjects, there were 414 significantly dysregulated probe sets in T21 fetuses, 2226 in TTTS recipient twins, and 470 in fetuses of obese women. Each analytic output was unique but complementary. For T21, both IPA and GSEA/DFLAT identified dysregulation of brain, cardiovascular, and integumentary system development. For TTTS, both analytic tools identified dysregulation of cell growth/proliferation, immune and inflammatory signaling, brain, and cardiovascular development. For maternal obesity, both tools identified dysregulation of immune and inflammatory signaling, brain and musculoskeletal development, and cell death. GSEA/DFLAT identified substantially more dysregulated biologic functions in fetuses of obese women (1203 vs 151). For all 3 datasets, GSEA/DFLAT provided more comprehensive information about brain development. IPA consistently provided more detailed annotation about cell death. IPA produced many dysregulated terms that pertained to cancer (14 in T21, 109 in TTTS, 26 in maternal obesity); GSEA/DFLAT did not. Interpretation of the fetal amniotic fluid supernatant transcriptome depends on the analytic program, which suggests that >1 resource should be used. Within IPA, physiologic cellular proliferation in the fetus produced many "false positive" annotations that pertained to cancer, which reflects its bias toward adult diseases. This study supports the use of gene annotation resources with a developmental focus, such as DFLAT, for 'omics studies in perinatal medicine. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. A systematic analysis of genomic changes in Tg2576 mice.

    PubMed

    Tan, Lu; Wang, Xiong; Ni, Zhong-Fei; Zhu, Xiuming; Wu, Wei; Zhu, Ling-Qiang; Liu, Dan

    2013-06-01

    Alzheimer's disease (AD) is an age-related neurodegenerative disorder characterized by intelligence decline, behavioral disorders and cognitive disability. The purpose of this study was to investigate gene expression in AD, based on published microarray data on Tg2576 mice. Hierarchical Cluster Analysis and Gene Ontology were employed to group genes together on the basis of their product characteristics and annotation data. Genes with prominent alterations were clustered into apoptosis and axon guidance pathways. Based on our findings and those of previous studies, we propose that the mitochondria-mediated apoptotic pathway plays a crucial role in the neuronal loss and synaptic dysfunction associated with AD. Furthermore, based on the findings of Positional Gene Enrichment analysis and Gene Set Enrichment analysis, we propose that the regulation of transcription of AD genes may be an important pathogenic factor in this neurodegenerative disease. Our results highlight the importance of genes that could subsequently be examined for their potential as prognostic markers for AD.

  11. Systems genetics identifies a convergent gene network for cognition and neurodevelopmental disease.

    PubMed

    Johnson, Michael R; Shkura, Kirill; Langley, Sarah R; Delahaye-Duriez, Andree; Srivastava, Prashant; Hill, W David; Rackham, Owen J L; Davies, Gail; Harris, Sarah E; Moreno-Moral, Aida; Rotival, Maxime; Speed, Doug; Petrovski, Slavé; Katz, Anaïs; Hayward, Caroline; Porteous, David J; Smith, Blair H; Padmanabhan, Sandosh; Hocking, Lynne J; Starr, John M; Liewald, David C; Visconti, Alessia; Falchi, Mario; Bottolo, Leonardo; Rossetti, Tiziana; Danis, Bénédicte; Mazzuferi, Manuela; Foerch, Patrik; Grote, Alexander; Helmstaedter, Christoph; Becker, Albert J; Kaminski, Rafal M; Deary, Ian J; Petretto, Enrico

    2016-02-01

    Genetic determinants of cognition are poorly characterized, and their relationship to genes that confer risk for neurodevelopmental disease is unclear. Here we performed a systems-level analysis of genome-wide gene expression data to infer gene-regulatory networks conserved across species and brain regions. Two of these networks, M1 and M3, showed replicable enrichment for common genetic variants underlying healthy human cognitive abilities, including memory. Using exome sequence data from 6,871 trios, we found that M3 genes were also enriched for mutations ascertained from patients with neurodevelopmental disease generally, and intellectual disability and epileptic encephalopathy in particular. M3 consists of 150 genes whose expression is tightly developmentally regulated, but which are collectively poorly annotated for known functional pathways. These results illustrate how systems-level analyses can reveal previously unappreciated relationships between neurodevelopmental disease-associated genes in the developed human brain, and provide empirical support for a convergent gene-regulatory network influencing cognition and neurodevelopmental disease.

  12. Genome Neighborhood Network Reveals Insights into Enediyne Biosynthesis and Facilitates Prediction and Prioritization for Discovery

    PubMed Central

    Rudolf, Jeffrey D.; Yan, Xiaohui; Shen, Ben

    2015-01-01

    The enediynes are one of the most fascinating families of bacterial natural products given their unprecedented molecular architecture and extraordinary cytotoxicity. Enediynes are rare with only 11 structurally characterized members and four additional members isolated in their cycloaromatized form. Recent advances in DNA sequencing have resulted in an explosion of microbial genomes. A virtual survey of the GenBank and JGI genome databases revealed 87 enediyne biosynthetic gene clusters from 78 bacteria strains, implying enediynes are more common than previously thought. Here we report the construction and analysis of an enediyne genome neighborhood network (GNN) as a high-throughput approach to analyze secondary metabolite gene clusters. Analysis of the enediyne GNN facilitated rapid gene cluster annotation, revealed genetic trends in enediyne biosynthetic gene clusters resulting in a simple prediction scheme to determine 9- vs 10-membered enediyne gene clusters, and supported a genomic-based strain prioritization method for enediyne discovery. PMID:26318027

  13. Rj4, a Gene Controlling Nodulation Specificity in Soybeans, Encodes a Thaumatin-Like Protein But Not the One Previously Reported1

    PubMed Central

    Tang, Fang; Yang, Shengming; Liu, Jinge

    2016-01-01

    Rj4 is a dominant gene in soybeans (Glycine max) that restricts nodulation by many strains of Bradyrhizobium elkanii. The soybean-B. elkanii symbiosis has a low nitrogen-fixation efficiency, but B. elkanii strains are highly competitive for nodulation; thus, cultivars harboring an Rj4 allele are considered favorable. Cloning the Rj4 gene is the first step in understanding the molecular basis of Rj4-mediated nodulation restriction and facilitates the development of molecular tools for genetic improvement of nitrogen fixation in soybeans. We finely mapped the Rj4 locus within a small genomic region on soybean chromosome 1, and validated one of the candidate genes as Rj4 using both complementation tests and CRISPR/Cas9-based gene knockout experiments. We demonstrated that Rj4 encodes a thaumatin-like protein, for which a corresponding allele is not present in the surveyed rj4 genotypes, including the reference genome Williams 82. Our conclusion disagrees with the previous report that Rj4 is the Glyma.01G165800 gene (previously annotated as Glyma01g37060). Instead, we provide convincing evidence that Rj4 is Glyma.01g165800-D, a duplicated and unique version of Glyma.01g165800, that has evolved the ability to control symbiotic specificity. PMID:26582727

  14. De novo characterisation of the greenlip abalone transcriptome (Haliotis laevigata) with a focus on the heat shock protein 70 (HSP70) family.

    PubMed

    Shiel, Brett P; Hall, Nathan E; Cooke, Ira R; Robinson, Nicholas A; Strugnell, Jan M

    2015-02-01

    Abalone (Haliotis) are economically important molluscs for fisheries and aquaculture industries worldwide. Despite this, genomic resources for abalone and molluscs are still limited. Here we present a description and functional annotation of the greenlip abalone (Haliotis laevigata) transcriptome. We present a focused analysis on the heat shock protein 70 (HSP70) family of genes with putative functions affecting temperature stress and immunity. A total of ~38 million paired end Illumina reads were obtained, resulting in a Trinity assembly of 222,172 contigs with minimum length of 200 base pairs and maximum length of 33 kilobases. The 20,702 contigs were annotated with gene descriptions by BLAST. We created a program to maximise the number of functionally annotated genes, and over 10,000 contigs were assigned Gene ontologies (GO terms). By using CateGOrizer, immunity related GO terms for stressors such as heat, hypoxia, oxidative stress and wounding received the highest counts. Twenty-six contigs with homology to the HSP70 family of genes were identified. Ninety-one putative single-nucleotide polymorphisms were observed in the abalone HSP70 contigs. Eleven of these were considered non-synonymous. The annotated transcriptome described in this study will be a useful basis for future work investigating the genetic response of abalone to stress.

  15. Brain transcriptome sequencing and assembly of three songbird model systems for the study of social behavior

    PubMed Central

    Mukai, Motoko; Gonser, Rusty A.; Wingfield, John C.; London, Sarah E.; Tuttle, Elaina M.; Clayton, David F.

    2014-01-01

    Emberizid sparrows (emberizidae) have played a prominent role in the study of avian vocal communication and social behavior. We present here brain transcriptomes for three emberizid model systems, song sparrow Melospiza melodia, white-throated sparrow Zonotrichia albicollis, and Gambel’s white-crowned sparrow Zonotrichia leucophrys gambelii. Each of the assemblies covered fully or in part, over 89% of the previously annotated protein coding genes in the zebra finch Taeniopygia guttata, with 16,846, 15,805, and 16,646 unique BLAST hits in song, white-throated and white-crowned sparrows, respectively. As in previous studies, we find tissue of origin (auditory forebrain versus hypothalamus and whole brain) as an important determinant of overall expression profile. We also demonstrate the successful isolation of RNA and RNA-sequencing from post-mortem samples from building strikes and suggest that such an approach could be useful when traditional sampling opportunities are limited. These transcriptomes will be an important resource for the study of social behavior in birds and for data driven annotation of forthcoming whole genome sequences for these and other bird species. PMID:24883256

  16. Network analysis of oyster transcriptome revealed a cascade of cellular responses during recovery after heat shock.

    PubMed

    Zhang, Lingling; Hou, Rui; Su, Hailin; Hu, Xiaoli; Wang, Shi; Bao, Zhenmin

    2012-01-01

    Oysters, as a major group of marine bivalves, can tolerate a wide range of natural and anthropogenic stressors including heat stress. Recent studies have shown that oysters pretreated with heat shock can result in induced heat tolerance. A systematic study of cellular recovery from heat shock may provide insights into the mechanism of acquired thermal tolerance. In this study, we performed the first network analysis of oyster transcriptome by reanalyzing microarray data from a previous study. Network analysis revealed a cascade of cellular responses during oyster recovery after heat shock and identified responsive gene modules and key genes. Our study demonstrates the power of network analysis in a non-model organism with poor gene annotations, which can lead to new discoveries that go beyond the focus on individual genes.

  17. CuGene as a tool to view and explore genomic data

    NASA Astrophysics Data System (ADS)

    Haponiuk, Michał; Pawełkowicz, Magdalena; Przybecki, Zbigniew; Nowak, Robert M.

    2017-08-01

    Integrated CuGene is an easy-to-use, open-source, on-line tool that can be used to browse, analyze, and query genomic data and annotations. It places annotation tracks beneath genome coordinate positions, allowing rapid visual correlation of different types of information. It also allows users to upload and display their own experimental results or annotation sets. An important functionality of the application is a possibility to find similarity between sequences by applying four different algorithms of different accuracy. The presented tool was tested on real genomic data and is extensively used by Polish Consortium of Cucumber Genome Sequencing.

  18. FARO server: Meta-analysis of gene expression by matching gene expression signatures to a compendium of public gene expression data.

    PubMed

    Manijak, Mieszko P; Nielsen, Henrik B

    2011-06-11

    Although, systematic analysis of gene annotation is a powerful tool for interpreting gene expression data, it sometimes is blurred by incomplete gene annotation, missing expression response of key genes and secondary gene expression responses. These shortcomings may be partially circumvented by instead matching gene expression signatures to signatures of other experiments. To facilitate this we present the Functional Association Response by Overlap (FARO) server, that match input signatures to a compendium of 242 gene expression signatures, extracted from more than 1700 Arabidopsis microarray experiments. Hereby we present a publicly available tool for robust characterization of Arabidopsis gene expression experiments which can point to similar experimental factors in other experiments. The server is available at http://www.cbs.dtu.dk/services/faro/.

  19. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics

    PubMed Central

    Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R.; Williams, Angela H.; Tan, Yew-Foon; Testa, Alison C.; Gorman, Jeffrey J.; Oliver, Richard P.

    2016-01-01

    Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models. PMID:26840125

  20. AgBase: supporting functional modeling in agricultural organisms

    PubMed Central

    McCarthy, Fiona M.; Gresham, Cathy R.; Buza, Teresia J.; Chouvarine, Philippe; Pillai, Lakshmi R.; Kumar, Ranjit; Ozkan, Seval; Wang, Hui; Manda, Prashanti; Arick, Tony; Bridges, Susan M.; Burgess, Shane C.

    2011-01-01

    AgBase (http://www.agbase.msstate.edu/) provides resources to facilitate modeling of functional genomics data and structural and functional annotation of agriculturally important animal, plant, microbe and parasite genomes. The website is redesigned to improve accessibility and ease of use, including improved search capabilities. Expanded capabilities include new dedicated pages for horse, cat, dog, cotton, rice and soybean. We currently provide 590 240 Gene Ontology (GO) annotations to 105 454 gene products in 64 different species, including GO annotations linked to transcripts represented on agricultural microarrays. For many of these arrays, this provides the only functional annotation available. GO annotations are available for download and we provide comprehensive, species-specific GO annotation files for 18 different organisms. The tools available at AgBase have been expanded and several existing tools improved based upon user feedback. One of seven new tools available at AgBase, GOModeler, supports hypothesis testing from functional genomics data. We host several associated databases and provide genome browsers for three agricultural pathogens. Moreover, we provide comprehensive training resources (including worked examples and tutorials) via links to Educational Resources at the AgBase website. PMID:21075795

  1. A-MADMAN: Annotation-based microarray data meta-analysis tool

    PubMed Central

    Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania

    2009-01-01

    Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634

  2. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications.

    PubMed

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene-patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene-patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at http://www.patome.org/; the information is updated bimonthly.

  3. RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets.

    PubMed

    Lazzarato, F; Franceschinis, G; Botta, M; Cordero, F; Calogero, R A

    2004-11-01

    RRE allows the extraction of non-coding regions surrounding a coding sequence [i.e. gene upstream region, 5'-untranslated region (5'-UTR), introns, 3'-UTR, downstream region] from annotated genomic datasets available at NCBI. RRE parser and web-based interface are accessible at http://www.bioinformatica.unito.it/bioinformatics/rre/rre.html

  4. The Planteome database: an integrated resource for reference ontologies, plant genomics and phenomics

    PubMed Central

    Cooper, Laurel; Meier, Austin; Laporte, Marie-Angélique; Elser, Justin L; Mungall, Chris; Sinn, Brandon T; Cavaliere, Dario; Carbon, Seth; Dunn, Nathan A; Smith, Barry; Qu, Botong; Preece, Justin; Zhang, Eugene; Todorovic, Sinisa; Gkoutos, Georgios; Doonan, John H; Stevenson, Dennis W; Arnaud, Elizabeth

    2018-01-01

    Abstract The Planteome project (http://www.planteome.org) provides a suite of reference and species-specific ontologies for plants and annotations to genes and phenotypes. Ontologies serve as common standards for semantic integration of a large and growing corpus of plant genomics, phenomics and genetics data. The reference ontologies include the Plant Ontology, Plant Trait Ontology and the Plant Experimental Conditions Ontology developed by the Planteome project, along with the Gene Ontology, Chemical Entities of Biological Interest, Phenotype and Attribute Ontology, and others. The project also provides access to species-specific Crop Ontologies developed by various plant breeding and research communities from around the world. We provide integrated data on plant traits, phenotypes, and gene function and expression from 95 plant taxa, annotated with reference ontology terms. The Planteome project is developing a plant gene annotation platform; Planteome Noctua, to facilitate community engagement. All the Planteome ontologies are publicly available and are maintained at the Planteome GitHub site (https://github.com/Planteome) for sharing, tracking revisions and new requests. The annotated data are freely accessible from the ontology browser (http://browser.planteome.org/amigo) and our data repository. PMID:29186578

  5. Bacillus anthracis genome organization in light of whole transcriptome sequencing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martin, Jeffrey; Zhu, Wenhan; Passalacqua, Karla D.

    2010-03-22

    Emerging knowledge of whole prokaryotic transcriptomes could validate a number of theoretical concepts introduced in the early days of genomics. What are the rules connecting gene expression levels with sequence determinants such as quantitative scores of promoters and terminators? Are translation efficiency measures, e.g. codon adaptation index and RBS score related to gene expression? We used the whole transcriptome shotgun sequencing of a bacterial pathogen Bacillus anthracis to assess correlation of gene expression level with promoter, terminator and RBS scores, codon adaptation index, as well as with a new measure of gene translational efficiency, average translation speed. We compared computationalmore » predictions of operon topologies with the transcript borders inferred from RNA-Seq reads. Transcriptome mapping may also improve existing gene annotation. Upon assessment of accuracy of current annotation of protein-coding genes in the B. anthracis genome we have shown that the transcriptome data indicate existence of more than a hundred genes missing in the annotation though predicted by an ab initio gene finder. Interestingly, we observed that many pseudogenes possess not only a sequence with detectable coding potential but also promoters that maintain transcriptional activity.« less

  6. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture.

    PubMed

    Driscoll, Connor B; Otten, Timothy G; Brown, Nathan M; Dreher, Theo W

    2017-01-01

    Here we report three complete bacterial genome assemblies from a PacBio shotgun metagenome of a co-culture from Upper Klamath Lake, OR. Genome annotations and culture conditions indicate these bacteria are dependent on carbon and nitrogen fixation from the cyanobacterium Aphanizomenon flos-aquae, whose genome was assembled to draft-quality . Due to their taxonomic novelty relative to previously sequenced bacteria, we have temporarily designated these bacteria as incertae sedis Hyphomonadaceae strain UKL13-1 (3,501,508 bp and 56.12% GC), incertae sedis Betaproteobacterium strain UKL13-2 (3,387,087 bp and 54.98% GC), and incertae sedis Bacteroidetes strain UKL13-3 (3,236,529 bp and 37.33% GC). Each genome consists of a single circular chromosome with no identified plasmids. When compared with binned Illumina assemblies of the same three genomes, there was ~7% discrepancy in total genome length. Gaps where Illumina assemblies broke were often due to repetitive elements. Within these missing sequences were essential genes and genes associated with a variety of functional categories. Annotated gene content reveals that both Proteobacteria are aerobic anoxygenic phototrophs, with Betaproteobacterium UKL13-2 potentially capable of phototrophic oxidation of sulfur compounds. Both proteobacterial genomes contain transporters suggesting they are scavenging fixed nitrogen from A. flos-aquae in the form of ammonium. Bacteroidetes UKL13-3 has few completely annotated biosynthetic pathways, and has a comparatively higher proportion of unannotated genes. The genomes were detected in only a few other freshwater metagenomes, suggesting that these bacteria are not ubiquitous in freshwater systems. Our results indicate that long-read sequencing is a viable method for sequencing dominant members from low-diversity microbial communities, and should be considered for environmental metagenomics when conditions meet these requirements.

  7. Visual Display of 5p-arm and 3p-arm miRNA Expression with a Mobile Application.

    PubMed

    Pan, Chao-Yu; Kuo, Wei-Ting; Chiu, Chien-Yuan; Lin, Wen-Chang

    2017-01-01

    MicroRNAs (miRNAs) play important roles in human cancers. In previous studies, we have demonstrated that both 5p-arm and 3p-arm of mature miRNAs could be expressed from the same precursor and we further interrogated the 5p-arm and 3p-arm miRNA expression with a comprehensive arm feature annotation list. To assist biologists to visualize the differential 5p-arm and 3p-arm miRNA expression patterns, we utilized a user-friendly mobile App to display. The Cancer Genome Atlas (TCGA) miRNA-Seq expression information. We have collected over 4,500 miRNA-Seq datasets from 15 TCGA cancer types and further processed them with the 5p-arm and 3p-arm annotation analysis pipeline. In order to be displayed with the RNA-Seq Viewer App, annotated 5p-arm and 3p-arm miRNA expression information and miRNA gene loci information were converted into SQLite tables. In this distinct application, for any given miRNA gene, 5p-arm miRNA is illustrated on the top of chromosome ideogram and 3p-arm miRNA is illustrated on the bottom of chromosome ideogram. Users can then easily interrogate the differentially 5p-arm/3p-arm expressed miRNAs with their mobile devices. This study demonstrates the feasibility and utility of RNA-Seq Viewer App in addition to mRNA-Seq data visualization.

  8. Finding the Subcellular Location of Barley, Wheat, Rice and Maize Proteins: The Compendium of Crop Proteins with Annotated Locations (cropPAL).

    PubMed

    Hooper, Cornelia M; Castleden, Ian R; Aryamanesh, Nader; Jacoby, Richard P; Millar, A Harvey

    2016-01-01

    Barley, wheat, rice and maize provide the bulk of human nutrition and have extensive industrial use as agricultural products. The genomes of these crops each contains >40,000 genes encoding proteins; however, the major genome databases for these species lack annotation information of protein subcellular location for >80% of these gene products. We address this gap, by constructing the compendium of crop protein subcellular locations called crop Proteins with Annotated Locations (cropPAL). Subcellular location is most commonly determined by fluorescent protein tagging of live cells or mass spectrometry detection in subcellular purifications, but can also be predicted from amino acid sequence or protein expression patterns. The cropPAL database collates 556 published studies, from >300 research institutes in >30 countries that have been previously published, as well as compiling eight pre-computed subcellular predictions for all Hordeum vulgare, Triticum aestivum, Oryza sativa and Zea mays protein sequences. The data collection including metadata for proteins and published studies can be accessed through a search portal http://crop-PAL.org. The subcellular localization information housed in cropPAL helps to depict plant cells as compartmentalized protein networks that can be investigated for improving crop yield and quality, and developing new biotechnological solutions to agricultural challenges. © The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  9. Scanning the Human Genome for Novel Therapeutic Targets for Breast Cancer

    DTIC Science & Technology

    2005-04-01

    colon cancer genome, in sum representing only 34 annotated genes (Figure 3A). Consistent with its role in the pathogenesis of human cancers ( Ruas and...high-confidence list includes two previously established tumor suppressors, p16INKaA and TGFI3RII (Derynck et al., 2001; Ruas and Peters, 1998; Siegel...cancer. Nat Rev Cancer 4, 118-132. Chong, J. A., Tapia- Ramirez , J., Kim, S., Toledo-Aral, J. J., Zheng, Y., Boutros, M. C.. Altshuller, Y. M., Frohman

  10. RGmatch: matching genomic regions to proximal genes in omics data integration.

    PubMed

    Furió-Tarí, Pedro; Conesa, Ana; Tarazona, Sonia

    2016-11-22

    The integrative analysis of multiple genomics data often requires that genome coordinates-based signals have to be associated with proximal genes. The relative location of a genomic region with respect to the gene (gene area) is important for functional data interpretation; hence algorithms that match regions to genes should be able to deliver insight into this information. In this work we review the tools that are publicly available for making region-to-gene associations. We also present a novel method, RGmatch, a flexible and easy-to-use Python tool that computes associations either at the gene, transcript, or exon level, applying a set of rules to annotate each region-gene association with the region location within the gene. RGmatch can be applied to any organism as long as genome annotation is available. Furthermore, we qualitatively and quantitatively compare RGmatch to other tools. RGmatch simplifies the association of a genomic region with its closest gene. At the same time, it is a powerful tool because the rules used to annotate these associations are very easy to modify according to the researcher's specific interests. Some important differences between RGmatch and other similar tools already in existence are RGmatch's flexibility, its wide range of user options, compatibility with any annotatable organism, and its comprehensive and user-friendly output.

  11. Integrating alternative splicing detection into gene prediction.

    PubMed

    Foissac, Sylvain; Schiex, Thomas

    2005-02-10

    Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.

  12. PhytoPath: an integrative resource for plant pathogen genomics.

    PubMed

    Pedro, Helder; Maheswari, Uma; Urban, Martin; Irvine, Alistair George; Cuzick, Alayne; McDowall, Mark D; Staines, Daniel M; Kulesha, Eugene; Hammond-Kosack, Kim Elizabeth; Kersey, Paul Julian

    2016-01-04

    PhytoPath (www.phytopathdb.org) is a resource for genomic and phenotypic data from plant pathogen species, that integrates phenotypic data for genes from PHI-base, an expertly curated catalog of genes with experimentally verified pathogenicity, with the Ensembl tools for data visualization and analysis. The resource is focused on fungi, protists (oomycetes) and bacterial plant pathogens that have genomes that have been sequenced and annotated. Genes with associated PHI-base data can be easily identified across all plant pathogen species using a BioMart-based query tool and visualized in their genomic context on the Ensembl genome browser. The PhytoPath resource contains data for 135 genomic sequences from 87 plant pathogen species, and 1364 genes curated for their role in pathogenicity and as targets for chemical intervention. Support for community annotation of gene models is provided using the WebApollo online gene editor, and we are working with interested communities to improve reference annotation for selected species. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. APPRIS 2017: principal isoforms for multiple gene sets

    PubMed Central

    Rodriguez-Rivas, Juan; Di Domenico, Tomás; Vázquez, Jesús; Valencia, Alfonso

    2018-01-01

    Abstract The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants. PMID:29069475

  14. Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.

    PubMed

    Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K

    2011-01-01

    Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.

  15. NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.

    PubMed

    Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin; Senger, Philipp

    2015-01-01

    Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html. © The Author(s) 2015. Published by Oxford University Press.

  16. NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases

    PubMed Central

    Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin

    2015-01-01

    Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article’s supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer’s disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html PMID:26475471

  17. ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data

    PubMed Central

    2010-01-01

    Background Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome. Results We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes. Conclusions ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database. PMID:20459804

  18. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  19. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

  20. Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

    PubMed Central

    Marko, Nicholas F.; Weil, Robert J.

    2012-01-01

    Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects. PMID:23118863

  1. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system.

    PubMed

    Chen, I-Min A; Markowitz, Victor M; Palaniappan, Krishna; Szeto, Ernest; Chu, Ken; Huang, Jinghua; Ratner, Anna; Pillay, Manoj; Hadjithomas, Michalis; Huntemann, Marcel; Mikhailova, Natalia; Ovchinnikova, Galina; Ivanova, Natalia N; Kyrpides, Nikos C

    2016-04-26

    The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existing IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.

  2. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derivedmore » from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.« less

  3. First Pass Annotation of Promoters on Human Chromosome 22

    PubMed Central

    Scherf, Matthias; Klingenhoff, Andreas; Frech, Kornelie; Quandt, Kerstin; Schneider, Ralf; Grote, Korbinian; Frisch, Matthias; Gailus-Durner, Valérie; Seidel, Alexander; Brack-Werner, Ruth; Werner, Thomas

    2001-01-01

    The publication of the first almost complete sequence of a human chromosome (chromosome 22) is a major milestone in human genomics. Together with the sequence, an excellent annotation of genes was published which certainly will serve as an information resource for numerous future projects. We noted that the annotation did not cover regulatory regions; in particular, no promoter annotation has been provided. Here we present an analysis of the complete published chromosome 22 sequence for promoters. A recent breakthrough in specific in silico prediction of promoter regions enabled us to attempt large-scale prediction of promoter regions on chromosome 22. Scanning of sequence databases revealed only 20 experimentally verified promoters, of which 10 were correctly predicted by our approach. Nearly 40% of our 465 predicted promoter regions are supported by the currently available gene annotation. Promoter finding also provides a biologically meaningful method for “chromosomal scaffolding”, by which long genomic sequences can be divided into segments starting with a gene. As one example, the combination of promoter region prediction with exon/intron structure predictions greatly enhances the specificity of de novo gene finding. The present study demonstrates that it is possible to identify promoters in silico on the chromosomal level with sufficient reliability for experimental planning and indicates that a wealth of information about regulatory regions can be extracted from current large-scale (megabase) sequencing projects. Results are available on-line at http://genomatix.gsf.de/chr22/. PMID:11230158

  4. Using microarrays to identify positional candidate genes for QTL: the case study of ACTH response in pigs.

    PubMed

    Jouffe, Vincent; Rowe, Suzanne; Liaubet, Laurence; Buitenhuis, Bart; Hornshøj, Henrik; SanCristobal, Magali; Mormède, Pierre; de Koning, D J

    2009-07-16

    Microarray studies can supplement QTL studies by suggesting potential candidate genes in the QTL regions, which by themselves are too large to provide a limited selection of candidate genes. Here we provide a case study where we explore ways to integrate QTL data and microarray data for the pig, which has only a partial genome sequence. We outline various procedures to localize differentially expressed genes on the pig genome and link this with information on published QTL. The starting point is a set of 237 differentially expressed cDNA clones in adrenal tissue from two pig breeds, before and after treatment with adrenocorticotropic hormone (ACTH). Different approaches to localize the differentially expressed (DE) genes to the pig genome showed different levels of success and a clear lack of concordance for some genes between the various approaches. For a focused analysis on 12 genes, overlapping QTL from the public domain were presented. Also, differentially expressed genes underlying QTL for ACTH response were described. Using the latest version of the draft sequence, the differentially expressed genes were mapped to the pig genome. This enabled co-location of DE genes and previously studied QTL regions, but the draft genome sequence is still incomplete and will contain many errors. A further step to explore links between DE genes and QTL at the pathway level was largely unsuccessful due to the lack of annotation of the pig genome. This could be improved by further comparative mapping analyses but this would be time consuming. This paper provides a case study for the integration of QTL data and microarray data for a species with limited genome sequence information and annotation. The results illustrate the challenges that must be addressed but also provide a roadmap for future work that is applicable to other non-model species.

  5. Lessons from non-canonical splicing

    PubMed Central

    Ule, Jernej

    2016-01-01

    Recent improvements in experimental and computational techniques used to study the transcriptome have enabled an unprecedented view of RNA processing, revealing many previously unknown non-canonical splicing events. This includes cryptic events located far from the currently annotated exons, and unconventional splicing mechanisms that have important roles in regulating gene expression. These non-canonical splicing events are a major source of newly emerging transcripts during evolution, especially when they involve sequences derived from transposable elements. They are therefore under precise regulation and quality control, which minimises their potential to disrupt gene expression. While non-canonical splicing can lead to aberrant transcripts that cause many diseases, we also explain how it can be exploited for new therapeutic strategies. PMID:27240813

  6. GFam: a platform for automatic annotation of gene families.

    PubMed

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-10-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.

  7. Consistent prediction of GO protein localization.

    PubMed

    Spetale, Flavio E; Arce, Debora; Krsticevic, Flavia; Bulacio, Pilar; Tapia, Elizabeth

    2018-05-17

    The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC + , a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC + classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC + classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC + classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.

  8. Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships.

    PubMed

    Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M

    2013-10-01

    The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. Copyright © 2013 The Authors. Published by Elsevier Inc. All rights reserved.

  9. EcoGene 3.0

    PubMed Central

    Zhou, Jindan; Rudd, Kenneth E.

    2013-01-01

    EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection. PMID:23197660

  10. pGenN, a gene normalization tool for plant genes and proteins in scientific literature.

    PubMed

    Ding, Ruoyao; Arighi, Cecilia N; Lee, Jung-Youn; Wu, Cathy H; Vijay-Shanker, K

    2015-01-01

    Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).

  11. EcoGene 3.0.

    PubMed

    Zhou, Jindan; Rudd, Kenneth E

    2013-01-01

    EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection.

  12. Characterization of gonadal transcriptomes from the turbot (Scophthalmus maximus).

    PubMed

    Hu, Yulong; Huang, Meng; Wang, Weiji; Guan, Jiantao; Kong, Jie

    2016-01-01

    The mechanisms underlying sexual reproduction and sex ratio determination remains unclear in turbot, a flatfish of great commercial value. And there is limited information in the turbot database regarding genes related to the reproductive system. Here, we conducted high-throughput transcriptome profiling of turbot gonad tissues to better understand their reproductive functions and to supply essential gene sequence information for marker-assisted selection programs in the turbot industry. In this study, two gonad libraries representing sex differences in Scophthalmus maximus yielded 453 818 high-quality reads that were assembled into 24 611 contigs and 33 713 singletons by using 454 pyrosequencing, 13 936 contigs and singletons (CS) of which were annotated using BLASTx. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analyses revealed that various biological functions and processes were associated with many of the annotated CS. Expression analyses showed that 510 genes were differentially expressed in males versus females; 80% of these genes were annotated. In addition, 6484 and 6036 single nucleotide polymorphisms (SNPs) were identified in male and female libraries, respectively. This transcriptome resource will serve as the foundation for cDNA or SNP microarray construction, gene expression characterization, and sex-specific linkage mapping in turbot.

  13. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  14. The Plant Ontology as a Tool for Comparative Plant Anatomy and Genomic Analyses

    PubMed Central

    Cooper, Laurel; Walls, Ramona L.; Elser, Justin; Gandolfo, Maria A.; Stevenson, Dennis W.; Smith, Barry; Preece, Justin; Athreya, Balaji; Mungall, Christopher J.; Rensing, Stefan; Hiss, Manuel; Lang, Daniel; Reski, Ralf; Berardini, Tanya Z.; Li, Donghui; Huala, Eva; Schaeffer, Mary; Menda, Naama; Arnaud, Elizabeth; Shrestha, Rosemary; Yamazaki, Yukiko; Jaiswal, Pankaj

    2013-01-01

    The Plant Ontology (PO; http://www.plantontology.org/) is a publicly available, collaborative effort to develop and maintain a controlled, structured vocabulary (‘ontology’) of terms to describe plant anatomy, morphology and the stages of plant development. The goals of the PO are to link (annotate) gene expression and phenotype data to plant structures and stages of plant development, using the data model adopted by the Gene Ontology. From its original design covering only rice, maize and Arabidopsis, the scope of the PO has been expanded to include all green plants. The PO was the first multispecies anatomy ontology developed for the annotation of genes and phenotypes. Also, to our knowledge, it was one of the first biological ontologies that provides translations (via synonyms) in non-English languages such as Japanese and Spanish. As of Release #18 (July 2012), there are about 2.2 million annotations linking PO terms to >110,000 unique data objects representing genes or gene models, proteins, RNAs, germplasm and quantitative trait loci (QTLs) from 22 plant species. In this paper, we focus on the plant anatomical entity branch of the PO, describing the organizing principles, resources available to users and examples of how the PO is integrated into other plant genomics databases and web portals. We also provide two examples of comparative analyses, demonstrating how the ontology structure and PO-annotated data can be used to discover the patterns of expression of the LEAFY (LFY) and terpene synthase (TPS) gene homologs. PMID:23220694

  15. Isoform Evolution in Primates through Independent Combination of Alternative RNA Processing Events

    PubMed Central

    Zhang, Shi-Jian; Wang, Chenqu; Yan, Shouyu; Fu, Aisi; Luan, Xuke; Li, Yumei; Sunny Shen, Qing; Zhong, Xiaoming; Chen, Jia-Yu; Wang, Xiangfeng; Chin-Ming Tan, Bertrand; He, Aibin; Li, Chuan-Yun

    2017-01-01

    Abstract Recent RNA-seq technology revealed thousands of splicing events that are under rapid evolution in primates, whereas the reliability of these events, as well as their combination on the isoform level, have not been adequately addressed due to its limited sequencing length. Here, we performed comparative transcriptome analyses in human and rhesus macaque cerebellum using single molecule long-read sequencing (Iso-seq) and matched RNA-seq. Besides 359 million RNA-seq reads, 4,165,527 Iso-seq reads were generated with a mean length of 14,875 bp, covering 11,466 human genes, and 10,159 macaque genes. With Iso-seq data, we substantially expanded the repertoire of alternative RNA processing events in primates, and found that intron retention and alternative polyadenylation are surprisingly more prevalent in primates than previously estimated. We then investigated the combinatorial mode of these alternative events at the whole-transcript level, and found that the combination of these events is largely independent along the transcript, leading to thousands of novel isoforms missed by current annotations. Notably, these novel isoforms are selectively constrained in general, and 1,119 isoforms have even higher expression than the previously annotated major isoforms in human, indicating that the complexity of the human transcriptome is still significantly underestimated. Comparative transcriptome analysis further revealed 502 genes encoding selectively constrained, lineage-specific isoforms in human but not in rhesus macaque, linking them to some lineage-specific functions. Overall, we propose that the independent combination of alternative RNA processing events has contributed to complex isoform evolution in primates, which provides a new foundation for the study of phenotypic difference among primates. PMID:28957512

  16. Antibiotic Resistance Markers in Burkholderia pseudomallei Strain Bp1651 Identified by Genome Sequence Analysis

    PubMed Central

    Sue, David; Gee, Jay E.; Elrod, Mindy G.; Hoffmaster, Alex R.; Randall, Linnell B.; Chirakul, Sunisa; Tuanyok, Apichai; Schweizer, Herbert P.; Weigel, Linda M.

    2017-01-01

    ABSTRACT Burkholderia pseudomallei Bp1651 is resistant to several classes of antibiotics that are usually effective for treatment of melioidosis, including tetracyclines, sulfonamides, and β-lactams such as penicillins (amoxicillin-clavulanic acid), cephalosporins (ceftazidime), and carbapenems (imipenem and meropenem). We sequenced, assembled, and annotated the Bp1651 genome and analyzed the sequence using comparative genomic analyses with susceptible strains, keyword searches of the annotation, publicly available antimicrobial resistance prediction tools, and published reports. More than 100 genes in the Bp1651 sequence were identified as potentially contributing to antimicrobial resistance. Most notably, we identified three previously uncharacterized point mutations in penA, which codes for a class A β-lactamase and was previously implicated in resistance to β-lactam antibiotics. The mutations result in amino acid changes T147A, D240G, and V261I. When individually introduced into select agent-excluded B. pseudomallei strain Bp82, D240G was found to contribute to ceftazidime resistance and T147A contributed to amoxicillin-clavulanic acid and imipenem resistance. This study provides the first evidence that mutations in penA may alter susceptibility to carbapenems in B. pseudomallei. Another mutation of interest was a point mutation affecting the dihydrofolate reductase gene folA, which likely explains the trimethoprim resistance of this strain. Bp1651 was susceptible to aminoglycosides likely because of a frameshift in the amrB gene, the transporter subunit of the AmrAB-OprA efflux pump. These findings expand the role of penA to include resistance to carbapenems and may assist in the development of molecular diagnostics that predict antimicrobial resistance and provide guidance for treatment of melioidosis. PMID:28396541

  17. Contribution of Three Bile-Associated Loci, bsh, pva, and btlB, to Gastrointestinal Persistence and Bile Tolerance of Listeria monocytogenes

    PubMed Central

    Begley, Máire; Sleator, Roy D.; Gahan, Cormac G. M.; Hill, Colin

    2005-01-01

    Listeria monocytogenes must resist the deleterious actions of bile in order to infect and subsequently colonize the human gastrointestinal tract. The molecular mechanisms used by the bacterium to resist bile and the influence of bile on pathogenesis are as yet largely unexplored. This study describes the analysis of three genes—bsh, pva, and btlB—previously annotated as bile-associated loci in the sequenced L. monocytogenes EGDe genome (lmo2067, lmo0446, and lmo0754, respectively). Analysis of deletion mutants revealed a role for all three genes in resisting the acute toxicity of bile and bile salts, particularly glycoconjugated bile salts at low pH. Mutants were unaffected in the other stress responses examined (acid, salt, and detergents). Bile hydrolysis assays demonstrate that L. monocytogenes possesses only one bile salt hydrolase gene, namely, bsh. Transcriptional analyses and activity assays revealed that, although it is regulated by both PrfA and σB, the latter appears to play the greater role in modulating bsh expression. In addition to being incapable of bile hydrolysis, a sigB mutant was shown to be exquisitely sensitive to bile salts. Furthermore, increased expression of sigB was detected under anaerobic conditions and during murine infection. A gene previously annotated as a possible penicillin V amidase (pva) or bile salt hydrolase was shown to be required for resistance to penicillin V but not penicillin G but did not demonstrate a role in bile hydrolysis. Finally, animal (murine) studies revealed an important role for both bsh and btlB in the intestinal persistence of L. monocytogenes. PMID:15664931

  18. Phylogenomics of MADS-Box Genes in Plants - Two Opposing Life Styles in One Gene Family.

    PubMed

    Gramzow, Lydia; Theißen, Günter

    2013-09-12

    The development of multicellular eukaryotes, according to their body plan, is often directed by members of multigene families that encode transcription factors. MADS (for MINICHROMOSOME MAINTENANCE1, AGAMOUS, DEFICIENS and SERUM RESPONSE FACTOR)-box genes form one of those families controlling nearly all major aspects of plant development. Knowing the complete complement of MADS-box genes in sequenced plant genomes will allow a better understanding of the evolutionary patterns of these genes and the association of their evolution with the evolution of plant morphologies. Here, we have applied a combination of automatic and manual annotations to identify the complete set of MADS-box genes in 17 plant genomes. Furthermore, three plant genomes were reanalyzed and published datasets were used for four genomes such that more than 2,600 genes from 24 species were classified into the two types of MADS-box genes, Type I and Type II. Our results extend previous studies, highlighting the remarkably different evolutionary patterns of Type I and Type II genes and provide a basis for further studies on the evolution and function of MADS-box genes.

  19. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research.

    PubMed

    Slenter, Denise N; Kutmon, Martina; Hanspers, Kristina; Riutta, Anders; Windsor, Jacob; Nunes, Nuno; Mélius, Jonathan; Cirillo, Elisa; Coort, Susan L; Digles, Daniela; Ehrhart, Friederike; Giesbertz, Pieter; Kalafati, Marianthi; Martens, Marvin; Miller, Ryan; Nishida, Kozo; Rieswijk, Linda; Waagmeester, Andra; Eijssen, Lars M T; Evelo, Chris T; Pico, Alexander R; Willighagen, Egon L

    2018-01-04

    WikiPathways (wikipathways.org) captures the collective knowledge represented in biological pathways. By providing a database in a curated, machine readable way, omics data analysis and visualization is enabled. WikiPathways and other pathway databases are used to analyze experimental data by research groups in many fields. Due to the open and collaborative nature of the WikiPathways platform, our content keeps growing and is getting more accurate, making WikiPathways a reliable and rich pathway database. Previously, however, the focus was primarily on genes and proteins, leaving many metabolites with only limited annotation. Recent curation efforts focused on improving the annotation of metabolism and metabolic pathways by associating unmapped metabolites with database identifiers and providing more detailed interaction knowledge. Here, we report the outcomes of the continued growth and curation efforts, such as a doubling of the number of annotated metabolite nodes in WikiPathways. Furthermore, we introduce an OpenAPI documentation of our web services and the FAIR (Findable, Accessible, Interoperable and Reusable) annotation of resources to increase the interoperability of the knowledge encoded in these pathways and experimental omics data. New search options, monthly downloads, more links to metabolite databases, and new portals make pathway knowledge more effortlessly accessible to individual researchers and research communities. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  1. DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures.

    PubMed

    Mazandu, Gaston K; Mulder, Nicola J

    2013-09-25

    The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis.

  2. The Genomes of Three Uneven Siblings: Footprints of the Lifestyles of Three Trichoderma Species

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schmoll, Monika; Dattenböck, Christoph; Carreras-Villaseñor, Nohemí

    SUMMARYThe genusTrichodermacontains fungi with high relevance for humans, with applications in enzyme production for plant cell wall degradation and use in biocontrol. Here, we provide a broad, comprehensive overview of the genomic content of these species for “hot topic” research aspects, including CAZymes, transport, transcription factors, and development, along with a detailed analysis and annotation of less-studied topics, such as signal transduction, genome integrity, chromatin, photobiology, or lipid, sulfur, and nitrogen metabolism inT. reesei,T. atroviride, andT. virens, and we open up new perspectives to those topics discussed previously. In total, we covered more than 2,000 of the predicted 9,000 tomore » 11,000 genes of eachTrichodermaspecies discussed, which is >20% of the respective gene content. Additionally, we considered available transcriptome data for the annotated genes. Highlights of our analyses include overall carbohydrate cleavage preferences due to the different genomic contents and regulation of the respective genes. We found light regulation of many sulfur metabolic genes. Additionally, a new Golgi 1,2-mannosidase likely involved inN-linked glycosylation was detected, as were indications for the ability ofTrichodermaspp. to generate hybrid galactose-containingN-linked glycans. The genomic inventory of effector proteins revealed numerous compounds unique toTrichoderma, and these warrant further investigation. We found interesting expansions in theTrichodermagenus in several signaling pathways, such as G-protein-coupled receptors, RAS GTPases, and casein kinases. A particularly interesting feature absolutely unique toT. atrovirideis the duplication of the alternative sulfur amino acid synthesis pathway.« less

  3. The developmental transcriptome of the mosquito Aedes aegypti, an invasive species and major arbovirus vector.

    PubMed

    Akbari, Omar S; Antoshechkin, Igor; Amrhein, Henry; Williams, Brian; Diloreto, Race; Sandler, Jeremy; Hay, Bruce A

    2013-09-04

    Mosquitoes are vectors of a number of important human and animal diseases. The development of novel vector control strategies requires a thorough understanding of mosquito biology. To facilitate this, we used RNA-seq to identify novel genes and provide the first high-resolution view of the transcriptome throughout development and in response to blood feeding in a mosquito vector of human disease, Aedes aegypti, the primary vector for Dengue and yellow fever. We characterized mRNA expression at 34 distinct time points throughout Aedes development, including adult somatic and germline tissues, by using polyA+ RNA-seq. We identify a total of 14,238 novel new transcribed regions corresponding to 12,597 new loci, as well as many novel transcript isoforms of previously annotated genes. Altogether these results increase the annotated fraction of the transcribed genome into long polyA+ RNAs by more than twofold. We also identified a number of patterns of shared gene expression, as well as genes and/or exons expressed sex-specifically or sex-differentially. Expression profiles of small RNAs in ovaries, early embryos, testes, and adult male and female somatic tissues also were determined, resulting in the identification of 38 new Aedes-specific miRNAs, and ~291,000 small RNA new transcribed regions, many of which are likely to be endogenous small-interfering RNAs and Piwi-interacting RNAs. Genes of potential interest for transgene-based vector control strategies also are highlighted. Our data have been incorporated into a user-friendly genome browser located at www.Aedes.caltech.edu, with relevant links to Vectorbase (www.vectorbase.org).

  4. The Genomes of Three Uneven Siblings: Footprints of the Lifestyles of Three Trichoderma Species

    PubMed Central

    Dattenböck, Christoph; Carreras-Villaseñor, Nohemí; Mendoza-Mendoza, Artemio; Tisch, Doris; Alemán, Mario Ivan; Baker, Scott E.; Brown, Christopher; Cervantes-Badillo, Mayte Guadalupe; Cetz-Chel, José; Cristobal-Mondragon, Gema Rosa; Delaye, Luis; Esquivel-Naranjo, Edgardo Ulises; Frischmann, Alexa; Gallardo-Negrete, Jose de Jesus; García-Esquivel, Monica; Gomez-Rodriguez, Elida Yazmin; Greenwood, David R.; Hernández-Oñate, Miguel; Kruszewska, Joanna S.; Lawry, Robert; Mora-Montes, Hector M.; Muñoz-Centeno, Tania; Nieto-Jacobo, Maria Fernanda; Nogueira Lopez, Guillermo; Olmedo-Monfil, Vianey; Osorio-Concepcion, Macario; Piłsyk, Sebastian; Pomraning, Kyle R.; Rodriguez-Iglesias, Aroa; Rosales-Saavedra, Maria Teresa; Sánchez-Arreguín, J. Alejandro; Seidl-Seiboth, Verena; Stewart, Alison; Uresti-Rivera, Edith Elena; Wang, Chih-Li; Wang, Ting-Fang; Zeilinger, Susanne; Casas-Flores, Sergio

    2016-01-01

    SUMMARY The genus Trichoderma contains fungi with high relevance for humans, with applications in enzyme production for plant cell wall degradation and use in biocontrol. Here, we provide a broad, comprehensive overview of the genomic content of these species for “hot topic” research aspects, including CAZymes, transport, transcription factors, and development, along with a detailed analysis and annotation of less-studied topics, such as signal transduction, genome integrity, chromatin, photobiology, or lipid, sulfur, and nitrogen metabolism in T. reesei, T. atroviride, and T. virens, and we open up new perspectives to those topics discussed previously. In total, we covered more than 2,000 of the predicted 9,000 to 11,000 genes of each Trichoderma species discussed, which is >20% of the respective gene content. Additionally, we considered available transcriptome data for the annotated genes. Highlights of our analyses include overall carbohydrate cleavage preferences due to the different genomic contents and regulation of the respective genes. We found light regulation of many sulfur metabolic genes. Additionally, a new Golgi 1,2-mannosidase likely involved in N-linked glycosylation was detected, as were indications for the ability of Trichoderma spp. to generate hybrid galactose-containing N-linked glycans. The genomic inventory of effector proteins revealed numerous compounds unique to Trichoderma, and these warrant further investigation. We found interesting expansions in the Trichoderma genus in several signaling pathways, such as G-protein-coupled receptors, RAS GTPases, and casein kinases. A particularly interesting feature absolutely unique to T. atroviride is the duplication of the alternative sulfur amino acid synthesis pathway. PMID:26864432

  5. The Biofuel Feedstock Genomics Resource: a web-based portal and database to enable functional genomics of plant biofuel feedstock species.

    PubMed

    Childs, Kevin L; Konganti, Kranti; Buell, C Robin

    2012-01-01

    Major feedstock sources for future biofuel production are likely to be high biomass producing plant species such as poplar, pine, switchgrass, sorghum and maize. One active area of research in these species is genome-enabled improvement of lignocellulosic biofuel feedstock quality and yield. To facilitate genomic-based investigations in these species, we developed the Biofuel Feedstock Genomic Resource (BFGR), a database and web-portal that provides high-quality, uniform and integrated functional annotation of gene and transcript assembly sequences from species of interest to lignocellulosic biofuel feedstock researchers. The BFGR includes sequence data from 54 species and permits researchers to view, analyze and obtain annotation at the gene, transcript, protein and genome level. Annotation of biochemical pathways permits the identification of key genes and transcripts central to the improvement of lignocellulosic properties in these species. The integrated nature of the BFGR in terms of annotation methods, orthologous/paralogous relationships and linkage to seven species with complete genome sequences allows comparative analyses for biofuel feedstock species with limited sequence resources. Database URL: http://bfgr.plantbiology.msu.edu.

  6. Reduce Manual Curation by Combining Gene Predictions from Multiple Annotation Engines, a Case Study of Start Codon Prediction

    PubMed Central

    Ederveen, Thomas H. A.; Overmars, Lex; van Hijum, Sacha A. F. T.

    2013-01-01

    Nowadays, prokaryotic genomes are sequenced faster than the capacity to manually curate gene annotations. Automated genome annotation engines provide users a straight-forward and complete solution for predicting ORF coordinates and function. For many labs, the use of AGEs is therefore essential to decrease the time necessary for annotating a given prokaryotic genome. However, it is not uncommon for AGEs to provide different and sometimes conflicting predictions. Combining multiple AGEs might allow for more accurate predictions. Here we analyzed the ab initio open reading frame (ORF) calling performance of different AGEs based on curated genome annotations of eight strains from different bacterial species with GC% ranging from 35–52%. We present a case study which demonstrates a novel way of comparative genome annotation, using combinations of AGEs in a pre-defined order (or path) to predict ORF start codons. The order of AGE combinations is from high to low specificity, where the specificity is based on the eight genome annotations. For each AGE combination we are able to derive a so-called projected confidence value, which is the average specificity of ORF start codon prediction based on the eight genomes. The projected confidence enables estimating likeliness of a correct prediction for a particular ORF start codon by a particular AGE combination, pinpointing ORFs notoriously difficult to predict start codons. We correctly predict start codons for 90.5±4.8% of the genes in a genome (based on the eight genomes) with an accuracy of 81.1±7.6%. Our consensus-path methodology allows a marked improvement over majority voting (9.7±4.4%) and with an optimal path ORF start prediction sensitivity is gained while maintaining a high specificity. PMID:23675487

  7. Effects of Annotations and Homework on Learning Achievement: An Empirical Study of Scratch Programming Pedagogy

    ERIC Educational Resources Information Center

    Su, Addison Y. S.; Huang, Chester S. J.; Yang, Stephen J. H.; Ding, T. J.; Hsieh, Y. Z.

    2015-01-01

    In Taiwan elementary schools, Scratch programming has been taught for more than four years. Previous studies have shown that personal annotations is a useful learning method that improve learning performance. An annotation-based Scratch programming (ASP) system provides for the creation, share, and review of annotations and homework solutions in…

  8. Exome sequencing analysis reveals variants in primary immunodeficiency genes in patients with very early onset inflammatory bowel disease.

    PubMed

    Kelsen, Judith R; Dawany, Noor; Moran, Christopher J; Petersen, Britt-Sabina; Sarmady, Mahdi; Sasson, Ariella; Pauly-Hubbard, Helen; Martinez, Alejandro; Maurer, Kelly; Soong, Joanne; Rappaport, Eric; Franke, Andre; Keller, Andreas; Winter, Harland S; Mamula, Petar; Piccoli, David; Artis, David; Sonnenberg, Gregory F; Daly, Mark; Sullivan, Kathleen E; Baldassano, Robert N; Devoto, Marcella

    2015-11-01

    Very early onset inflammatory bowel disease (VEO-IBD), IBD diagnosed at 5 years of age or younger, frequently presents with a different and more severe phenotype than older-onset IBD. We investigated whether patients with VEO-IBD carry rare or novel variants in genes associated with immunodeficiencies that might contribute to disease development. Patients with VEO-IBD and parents (when available) were recruited from the Children's Hospital of Philadelphia from March 2013 through July 2014. We analyzed DNA from 125 patients with VEO-IBD (age, 3 wk to 4 y) and 19 parents, 4 of whom also had IBD. Exome capture was performed by Agilent SureSelect V4, and sequencing was performed using the Illumina HiSeq platform. Alignment to human genome GRCh37 was achieved followed by postprocessing and variant calling. After functional annotation, candidate variants were analyzed for change in protein function, minor allele frequency less than 0.1%, and scaled combined annotation-dependent depletion scores of 10 or less. We focused on genes associated with primary immunodeficiencies and related pathways. An additional 210 exome samples from patients with pediatric IBD (n = 45) or adult-onset Crohn's disease (n = 20) and healthy individuals (controls, n = 145) were obtained from the University of Kiel, Germany, and used as control groups. Four hundred genes and regions associated with primary immunodeficiency, covering approximately 6500 coding exons totaling more than 1 Mbp of coding sequence, were selected from the whole-exome data. Our analysis showed novel and rare variants within these genes that could contribute to the development of VEO-IBD, including rare heterozygous missense variants in IL10RA and previously unidentified variants in MSH5 and CD19. In an exome sequence analysis of patients with VEO-IBD and their parents, we identified variants in genes that regulate B- and T-cell functions and could contribute to pathogenesis. Our analysis could lead to the identification of previously unidentified IBD-associated variants. Copyright © 2015 AGA Institute. Published by Elsevier Inc. All rights reserved.

  9. Pcal_0111, a highly thermostable bifunctional fructose-1,6-bisphosphate aldolase/phosphatase from Pyrobaculum calidifontis.

    PubMed

    Aziz, Iram; Rashid, Naeem; Ashraf, Raza; Bashir, Qamar; Imanaka, Tadayuki; Akhtar, Muhammad

    2017-05-01

    Pyrobaculum calidifontis genome harbors an open reading frame Pcal_0111 annotated as fructose bisphosphate aldolase. Although the gene is annotated as fructose bisphosphate aldolase, it exhibits a high homology with previously reported fructose-1,6-bisphosphate aldolase/phosphatase from Thermoproteus neutrophilus. To examine the biochemical properties of Pcal_0111, we have cloned and expressed the gene in Escherichia coli. Purified recombinant Pcal_0111 catalyzed both phosphatase and aldolase reactions with specific activity values of 4 U and 1.3 U, respectively. These values are highest among the fructose 1,6-bisphosphatases/aldolases characterized from archaea. The enzyme activity increased linearly with the increase in temperature until 100 °C. Recombinant Pcal_0111 is highly stable with a half-life of 120 min at 100 °C. There was no significant change in the circular dichroism spectra of the protein up to 90 °C. The enzyme activity was not affected by AMP but strongly inhibited by ATP with an IC 50 value of 0.75 mM and mildly by ADP. High thermostability and inhibition by ATP make Pcal_0111 a unique fructose 1,6-bisphosphatase/aldolase.

  10. De Novo Transcriptomic Analysis of Peripheral Blood Lymphocytes from the Chinese Goose: Gene Discovery and Immune System Pathway Description

    PubMed Central

    Tariq, Mansoor; Chen, Rong; Yuan, Hongyu; Liu, Yanjie; Wu, Yanan; Wang, Junya; Xia, Chun

    2015-01-01

    Background The Chinese goose is one of the most economically important poultry birds and is a natural reservoir for many avian viruses. However, the nature and regulation of the innate and adaptive immune systems of this waterfowl species are not completely understood due to limited information on the goose genome. Recently, transcriptome sequencing technology was applied in the genomic studies focused on novel gene discovery. Thus, this study described the transcriptome of the goose peripheral blood lymphocytes to identify immunity relevant genes. Principal Findings De novo transcriptome assembly of the goose peripheral blood lymphocytes was sequenced by Illumina-Solexa technology. In total, 211,198 unigenes were assembled from the 69.36 million cleaned reads. The average length, N50 size and the maximum length of the assembled unigenes were 687 bp, 1,298 bp and 18,992 bp, respectively. A total of 36,854 unigenes showed similarity by BLAST search against the NCBI non-redundant (Nr) protein database. For functional classification, 163,161 unigenes were comprised of three Gene Ontology (Go) categories and 67 subcategories. A total of 15,334 unigenes were annotated into 25 eukaryotic orthologous groups (KOGs) categories. Kyoto Encyclopedia of Genes and Genomes (KEGG) database annotated 39,585 unigenes into six biological functional groups and 308 pathways. Among the 2,757 unigenes that participated in the 15 immune system KEGG pathways, 125 of the most important immune relevant genes were summarized and analyzed by STRING analysis to identify gene interactions and relationships. Moreover, 10 genes were confirmed by PCR and analyzed. Of these 125 unigenes, 109 unigenes, approximately 87%, were not previously identified in the goose. Conclusion This de novo transcriptome analysis could provide important Chinese goose sequence information and highlights the value of new gene discovery, pathways investigation and immune system gene identification, and comparison with other avian species as useful tools to understand the goose immune system. PMID:25816068

  11. Functional annotation of regulatory pathways.

    PubMed

    Pandey, Jayesh; Koyutürk, Mehmet; Kim, Yohan; Szpankowski, Wojciech; Subramaniam, Shankar; Grama, Ananth

    2007-07-01

    Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level. We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, Narada, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations. Narada is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/.

  12. Representing virus-host interactions and other multi-organism processes in the Gene Ontology.

    PubMed

    Foulger, R E; Osumi-Sutherland, D; McIntosh, B K; Hulo, C; Masson, P; Poux, S; Le Mercier, P; Lomax, J

    2015-07-28

    The Gene Ontology project is a collaborative effort to provide descriptions of gene products in a consistent and computable language, and in a species-independent manner. The Gene Ontology is designed to be applicable to all organisms but up to now has been largely under-utilized for prokaryotes and viruses, in part because of a lack of appropriate ontology terms. To address this issue, we have developed a set of Gene Ontology classes that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology. Describing microbial and viral gene products brings with it the additional challenge of capturing both the host and the microbe. Recognising this, we have worked closely with annotation groups to test and optimize the GO classes, and we describe here a set of annotation guidelines that allow the controlled description of two interacting organisms. Building on the microbial resources already in existence such as ViralZone, UniProtKB keywords and MeGO, this project provides an integrated ontology to describe interactions between microbial species and their hosts, with mappings to the external resources above. Housing this information within the freely-accessible Gene Ontology project allows the classes and annotation structure to be utilized by a large community of biologists and users.

  13. Elucidation of primary metabolic pathways in Aspergillus species: orphaned research in characterizing orphan genes.

    PubMed

    Andersen, Mikael Rørdam

    2014-11-01

    Primary metabolism affects all phenotypical traits of filamentous fungi. Particular examples include reacting to extracellular stimuli, producing precursor molecules required for cell division and morphological changes as well as providing monomer building blocks for production of secondary metabolites and extracellular enzymes. In this review, all annotated genes from four Aspergillus species have been examined. In this process, it becomes evident that 80-96% of the genes (depending on the species) are still without verified function. A significant proportion of the genes with verified metabolic functions are assigned to secondary or extracellular metabolism, leaving only 2-4% of the annotated genes within primary metabolism. It is clear that primary metabolism has not received the same attention in the post-genomic area as many other research areas--despite its role at the very centre of cellular function. However, several methods can be employed to use the metabolic networks in tandem with comparative genomics to accelerate functional assignment of genes in primary metabolism. In particular, gaps in metabolic pathways can be used to assign functions to orphan genes. In this review, applications of this from the Aspergillus genes will be examined, and it is proposed that, where feasible, this should be a standard part of functional annotation of fungal genomes. © The Author 2014. Published by Oxford University Press.

  14. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; ...

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  15. Management and analysis of genomic functional and phenotypic controlled annotations to support biomedical investigation and practice.

    PubMed

    Masseroli, Marco

    2007-07-01

    The growing available genomic information provides new opportunities for novel research approaches and original biomedical applications that can provide effective data management and analysis support. In fact, integration and comprehensive evaluation of available controlled data can highlight information patterns leading to unveil new biomedical knowledge. Here, we describe Genome Function INtegrated Discover (GFINDer), a Web-accessible three-tier multidatabase system we developed to automatically enrich lists of user-classified genes with several functional and phenotypic controlled annotations, and to statistically evaluate them in order to identify annotation categories significantly over- or underrepresented in each considered gene class. Genomic controlled annotations from Gene Ontology (GO), KEGG, Pfam, InterPro, and Online Mendelian Inheritance in Man (OMIM) were integrated in GFINDer and several categorical tests were implemented for their analysis. A controlled vocabulary of inherited disorder phenotypes was obtained by normalizing and hierarchically structuring disease accompanying signs and symptoms from OMIM Clinical Synopsis sections. GFINDer modular architecture is well suited for further system expansion and for sustaining increasing workload. Testing results showed that GFINDer analyses can highlight gene functional and phenotypic characteristics and differences, demonstrating its value in supporting genomic biomedical approaches aiming at understanding the complex biomolecular mechanisms underlying patho-physiological phenotypes, and in helping the transfer of genomic results to medical practice.

  16. Annotating ebony on the fly.

    PubMed

    Kohn, Michael H; Wittkopp, Patricia J

    2007-07-01

    The distinctive black phenotype of ebony mutants has made it one of the most widely used phenotypic markers in Drosophila genetics. Without doubt, ebony showcases the fruits of the fly community's labours to annotate gene function. As of this writing, FlyBase lists 142 references, 1277 fly stocks, 15 phenotypes and 44 alleles. In addition to its namesake pigmentation phenotype, ebony mutants affect other traits, including phototaxis and courtship. With phenotypic consequences of ebony variants readily apparent in the laboratory, does natural selection also see them in the wild? In this issue of Molecular Ecology, Pool & Aquadro investigate this question and found signs of natural selection on the ebony gene that appear to have resulted from selection for darker pigmentation at higher elevations in sub-Saharan populations of Drosophila melanogaster. Such findings from population genomic analysis of wild-derived strains should be included in gene annotations to provide a more holistic view of a gene's function. The evolutionary annotation of ebony added by Pool & Aquadro substantiates that pigmentation can be adaptive and implicates elevation as an important selective factor. This is important progress because the selective factors seem to differ between populations and species. In addition, the study raises issues to consider when extrapolating from selection at the molecular level to selection at the phenotypic level.

  17. A Uniform System For The Annotation Of Human microRNA Genes And The Evolution Of The Human microRNAome

    PubMed Central

    Fromm, Bastian; Billipp, Tyler; Peck, Liam E.; Johansen, Morten; Tarver, James E.; King, Benjamin L.; Newcomb, James M.; Sempere, Lorenzo F.; Flatmark, Kjersti; Hovig, Eivind; Peterson, Kevin J.

    2016-01-01

    Although microRNAs (miRNAs) are among the most intensively studied molecules of the past 20 years, determining what is and what is not a miRNA has not been straightforward. Here, we present a uniform system for the annotation and nomenclature of miRNA genes. We show that fewer than a third of the 1,881 human miRBase entries, and only approximately 16% of the 7,095 metazoan miRBase entries, are robustly supported as miRNA genes. Furthermore, we show that the human repertoire of miRNAs has been shaped by periods of intense miRNA innovation, and that mature gene products show a very different tempo and mode of sequence evolution than star products. We establish a new open access database -- MirGeneDB (http://mirgenedb.org) -- to catalog this set of robustly supported miRNAs, which complements the efforts of miRBase, but differs from it by annotating the mature versus star products, and by imposing an evolutionary hierarchy upon this curated and consistently named repertoire. PMID:26473382

  18. Phage phenomics: Physiological approaches to characterize novel viral proteins

    ScienceCinema

    Sanchez, Savannah E. [San Diego State Univ., San Diego, CA (United States); Cuevas, Daniel A. [San Diego State Univ., San Diego, CA (United States); Rostron, Jason E. [San Diego State Univ., San Diego, CA (United States); Liang, Tiffany Y. [San Diego State Univ., San Diego, CA (United States); Pivaroff, Cullen G. [San Diego State Univ., San Diego, CA (United States); Haynes, Matthew R. [San Diego State Univ., San Diego, CA (United States); Nulton, Jim [San Diego State Univ., San Diego, CA (United States); Felts, Ben [San Diego State Univ., San Diego, CA (United States); Bailey, Barbara A. [San Diego State Univ., San Diego, CA (United States); Salamon, Peter [San Diego State Univ., San Diego, CA (United States); Edwards, Robert A. [San Diego State Univ., San Diego, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Burgin, Alex B. [Broad Institute, Cambridge, MA (United States); Segall, Anca M. [San Diego State Univ., San Diego, CA (United States); Rohwer, Forest [San Diego State Univ., San Diego, CA (United States)

    2018-06-21

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysis by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.

  19. PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature.

    PubMed

    Yoo, Danny; Xu, Iris; Berardini, Tanya Z; Rhee, Seung Yon; Narayanasamy, Vijay; Twigger, Simon

    2006-03-01

    For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.

  20. The coffee genome hub: a resource for coffee genomes

    PubMed Central

    Dereeper, Alexis; Bocs, Stéphanie; Rouard, Mathieu; Guignon, Valentin; Ravel, Sébastien; Tranchant-Dubreuil, Christine; Poncet, Valérie; Garsmeur, Olivier; Lashermes, Philippe; Droc, Gaëtan

    2015-01-01

    The whole genome sequence of Coffea canephora, the perennial diploid species known as Robusta, has been recently released. In the context of the C. canephora genome sequencing project and to support post-genomics efforts, we developed the Coffee Genome Hub (http://coffee-genome.org/), an integrative genome information system that allows centralized access to genomics and genetics data and analysis tools to facilitate translational and applied research in coffee. We provide the complete genome sequence of C. canephora along with gene structure, gene product information, metabolism, gene families, transcriptomics, syntenic blocks, genetic markers and genetic maps. The hub relies on generic software (e.g. GMOD tools) for easy querying, visualizing and downloading research data. It includes a Genome Browser enhanced by a Community Annotation System, enabling the improvement of automatic gene annotation through an annotation editor. In addition, the hub aims at developing interoperability among other existing South Green tools managing coffee data (phylogenomics resources, SNPs) and/or supporting data analyses with the Galaxy workflow manager. PMID:25392413

  1. Experimental annotation of the human genome using microarray technology.

    PubMed

    Shoemaker, D D; Schadt, E E; Armour, C D; He, Y D; Garrett-Engele, P; McDonagh, P D; Loerch, P M; Leonardson, A; Lum, P Y; Cavet, G; Wu, L F; Altschuler, S J; Edwards, S; King, J; Tsang, J S; Schimmack, G; Schelter, J M; Koch, J; Ziman, M; Marton, M J; Li, B; Cundiff, P; Ward, T; Castle, J; Krolewski, M; Meyer, M R; Mao, M; Burchard, J; Kidd, M J; Dai, H; Phillips, J W; Linsley, P S; Stoughton, R; Scherer, S; Boguski, M S

    2001-02-15

    The most important product of the sequencing of a genome is a complete, accurate catalogue of genes and their products, primarily messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-jet oligonucleotide synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and allow the detection of mRNA splice variants and identification of the tissue- and disease-specific conditions under which genes are expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, more efficient, full-length complementary DNA cloning strategies and application to complex diseases.

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  3. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software systemmore » to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.« less

  4. Global Transcriptional Responses to Osmotic, Oxidative, and Imipenem Stress Conditions in Pseudomonas putida

    PubMed Central

    Bojanovič, Klara; D'Arrigo, Isotta

    2017-01-01

    ABSTRACT Bacteria cope with and adapt to stress by modulating gene expression in response to specific environmental cues. In this study, the transcriptional response of Pseudomonas putida KT2440 to osmotic, oxidative, and imipenem stress conditions at two time points was investigated via identification of differentially expressed mRNAs and small RNAs (sRNAs). A total of 440 sRNA transcripts were detected, of which 10% correspond to previously annotated sRNAs, 40% to novel intergenic transcripts, and 50% to novel transcripts antisense to annotated genes. Each stress elicits a unique response as far as the extent and dynamics of the transcriptional changes. Nearly 200 protein-encoding genes exhibited significant changes in all stress types, implicating their participation in a general stress response. Almost half of the sRNA transcripts were differentially expressed under at least one condition, suggesting possible functional roles in the cellular response to stress conditions. The data show a larger fraction of differentially expressed sRNAs than of mRNAs with >5-fold expression changes. The work provides detailed insights into the mechanisms through which P. putida responds to different stress conditions and increases understanding of bacterial adaptation in natural and industrial settings. IMPORTANCE This study maps the complete transcriptional response of P. putida KT2440 to osmotic, oxidative, and imipenem stress conditions at short and long exposure times. Over 400 sRNA transcripts, consisting of both intergenic and antisense transcripts, were detected, increasing the number of identified sRNA transcripts in the strain by a factor of 10. Unique responses to each type of stress are documented, including both the extent and dynamics of the gene expression changes. The work adds rich detail to previous knowledge of stress response mechanisms due to the depth of the RNA sequencing data. Almost half of the sRNAs exhibit significant expression changes under at least one condition, suggesting their involvement in adaptation to stress conditions and identifying interesting candidates for further functional characterization. PMID:28130298

  5. Literature-based concept profiles for gene annotation: the issue of weighting.

    PubMed

    Jelier, Rob; Schuemie, Martijn J; Roes, Peter-Jan; van Mulligen, Erik M; Kors, Jan A

    2008-05-01

    Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment. Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance. All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.

  6. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system

    DOE PAGES

    Chen, I-Min A.; Markowitz, Victor M.; Palaniappan, Krishna; ...

    2016-04-26

    Background: The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Results: Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existingmore » IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. Conclusion: By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.« less

  7. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, I-Min A.; Markowitz, Victor M.; Palaniappan, Krishna

    Background: The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Results: Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existingmore » IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. Conclusion: By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.« less

  8. Muscle Research and Gene Ontology: New standards for improved data integration

    PubMed Central

    Feltrin, Erika; Campanaro, Stefano; Diehl, Alexander D; Ehler, Elisabeth; Faulkner, Georgine; Fordham, Jennifer; Gardin, Chiara; Harris, Midori; Hill, David; Knoell, Ralph; Laveder, Paolo; Mittempergher, Lorenza; Nori, Alessandra; Reggiani, Carlo; Sorrentino, Vincenzo; Volpe, Pompeo; Zara, Ivano; Valle, Giorgio; Deegan née Clark, Jennifer

    2009-01-01

    Background The Gene Ontology Project provides structured controlled vocabularies for molecular biology that can be used for the functional annotation of genes and gene products. In a collaboration between the Gene Ontology (GO) Consortium and the muscle biology community, we have made large-scale additions to the GO biological process and cellular component ontologies. The main focus of this ontology development work concerns skeletal muscle, with specific consideration given to the processes of muscle contraction, plasticity, development, and regeneration, and to the sarcomere and membrane-delimited compartments. Our aims were to update the existing structure to reflect current knowledge, and to resolve, in an accommodating manner, the ambiguity in the language used by the community. Results The updated muscle terminologies have been incorporated into the GO. There are now 159 new terms covering critical research areas, and 57 existing terms have been improved and reorganized to follow their usage in muscle literature. Conclusion The revised GO structure should improve the interpretation of data from high-throughput (e.g. microarray and proteomic) experiments in the area of muscle science and muscle disease. We actively encourage community feedback on, and gene product annotation with these new terms. Please visit the Muscle Community Annotation Wiki . PMID:19178689

  9. Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship.

    PubMed

    Brunet, Marie A; Levesque, Sébastien A; Hunting, Darel J; Cohen, Alan A; Roucou, Xavier

    2018-05-01

    Technological advances promise unprecedented opportunities for whole exome sequencing and proteomic analyses of populations. Currently, data from genome and exome sequencing or proteomic studies are searched against reference genome annotations. This provides the foundation for research and clinical screening for genetic causes of pathologies. However, current genome annotations substantially underestimate the proteomic information encoded within a gene. Numerous studies have now demonstrated the expression and function of alternative (mainly small, sometimes overlapping) ORFs within mature gene transcripts. This has important consequences for the correlation of phenotypes and genotypes. Most alternative ORFs are not yet annotated because of a lack of evidence, and this absence from databases precludes their detection by standard proteomic methods, such as mass spectrometry. Here, we demonstrate how current approaches tend to overlook alternative ORFs, hindering the discovery of new genetic drivers and fundamental research. We discuss available tools and techniques to improve identification of proteins from alternative ORFs and finally suggest a novel annotation system to permit a more complete representation of the transcriptomic and proteomic information contained within a gene. Given the crucial challenge of distinguishing functional ORFs from random ones, the suggested pipeline emphasizes both experimental data and conservation signatures. The addition of alternative ORFs in databases will render identification less serendipitous and advance the pace of research and genomic knowledge. This review highlights the urgent medical and research need to incorporate alternative ORFs in current genome annotations and thus permit their inclusion in hypotheses and models, which relate phenotypes and genotypes. © 2018 Brunet et al.; Published by Cold Spring Harbor Laboratory Press.

  10. The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions

    PubMed Central

    Kim, Sun; Chatr-aryamontri, Andrew; Chang, Christie S.; Oughtred, Rose; Rust, Jennifer; Wilbur, W. John; Comeau, Donald C.; Dolinski, Kara; Tyers, Mike

    2017-01-01

    A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html PMID:28077563

  11. LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures.

    PubMed

    Ryan, Michael; Diekhans, Mark; Lien, Stephanie; Liu, Yun; Karchin, Rachel

    2009-06-01

    LS-SNP/PDB is a new WWW resource for genome-wide annotation of human non-synonymous (amino acid changing) SNPs. It serves high-quality protein graphics rendered with UCSF Chimera molecular visualization software. The system is kept up-to-date by an automated, high-throughput build pipeline that systematically maps human nsSNPs onto Protein Data Bank structures and annotates several biologically relevant features. LS-SNP/PDB is available at (http://ls-snp.icm.jhu.edu/ls-snp-pdb) and via links from protein data bank (PDB) biology and chemistry tabs, UCSC Genome Browser Gene Details and SNP Details pages and PharmGKB Gene Variants Downloads/Cross-References pages.

  12. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  13. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  14. Genome-wide profiling of 24 hr diel rhythmicity in the water flea, Daphnia pulex: network analysis reveals rhythmic gene expression and enhances functional gene annotation.

    PubMed

    Rund, Samuel S C; Yoo, Boyoung; Alam, Camille; Green, Taryn; Stephens, Melissa T; Zeng, Erliang; George, Gary F; Sheppard, Aaron D; Duffield, Giles E; Milenković, Tijana; Pfrender, Michael E

    2016-08-18

    Marine and freshwater zooplankton exhibit daily rhythmic patterns of behavior and physiology which may be regulated directly by the light:dark (LD) cycle and/or a molecular circadian clock. One of the best-studied zooplankton taxa, the freshwater crustacean Daphnia, has a 24 h diel vertical migration (DVM) behavior whereby the organism travels up and down through the water column daily. DVM plays a critical role in resource tracking and the behavioral avoidance of predators and damaging ultraviolet radiation. However, there is little information at the transcriptional level linking the expression patterns of genes to the rhythmic physiology/behavior of Daphnia. Here we analyzed genome-wide temporal transcriptional patterns from Daphnia pulex collected over a 44 h time period under a 12:12 LD cycle (diel) conditions using a cosine-fitting algorithm. We used a comprehensive network modeling and analysis approach to identify novel co-regulated rhythmic genes that have similar network topological properties and functional annotations as rhythmic genes identified by the cosine-fitting analyses. Furthermore, we used the network approach to predict with high accuracy novel gene-function associations, thus enhancing current functional annotations available for genes in this ecologically relevant model species. Our results reveal that genes in many functional groupings exhibit 24 h rhythms in their expression patterns under diel conditions. We highlight the rhythmic expression of immunity, oxidative detoxification, and sensory process genes. We discuss differences in the chronobiology of D. pulex from other well-characterized terrestrial arthropods. This research adds to a growing body of literature suggesting the genetic mechanisms governing rhythmicity in crustaceans may be divergent from other arthropod lineages including insects. Lastly, these results highlight the power of using a network analysis approach to identify differential gene expression and provide novel functional annotation.

  15. Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17

    PubMed Central

    Hart, Elizabeth A; Caccamo, Mario; Harrow, Jennifer L; Humphray, Sean J; Gilbert, James GR; Trevanion, Steve; Hubbard, Tim; Rogers, Jane; Rothschild, Max F

    2007-01-01

    Background We describe here the sequencing, annotation and comparative analysis of an 8 Mb region of pig chromosome 17, which provides a useful test region to assess coverage and quality for the pig genome sequencing project. We report our findings comparing the annotation of draft sequence assembled at different depths of coverage. Results Within this region we annotated 71 loci, of which 53 are orthologous to human known coding genes. When compared to the syntenic regions in human (20q13.13-q13.33) and mouse (chromosome 2, 167.5 Mb-178.3 Mb), this region was found to be highly conserved with respect to gene order. The most notable difference between the three species is the presence of a large expansion of zinc finger coding genes and pseudogenes on mouse chromosome 2 between Edn3 and Phactr3 that is absent from pig and human. All of our annotation has been made publicly available in the Vertebrate Genome Annotation browser, VEGA. We assessed the impact of coverage on sequence assembly across this region and found, as expected, that increased sequence depth resulted in fewer, longer contigs. One-third of our annotated loci could not be fully re-aligned back to the low coverage version of the sequence, principally because the transcripts are fragmented over several contigs. Conclusion We have demonstrated the considerable advantages of sequencing at increased read depths and discuss the implications that lower coverage sequence may have on subsequent comparative and functional studies, particularly those involving complex loci such as GNAS. PMID:17705864

  16. Linking microarray reporters with protein functions.

    PubMed

    Gaj, Stan; van Erk, Arie; van Haaften, Rachel I M; Evelo, Chris T A

    2007-09-26

    The analysis of microarray experiments requires accurate and up-to-date functional annotation of the microarray reporters to optimize the interpretation of the biological processes involved. Pathway visualization tools are used to connect gene expression data with existing biological pathways by using specific database identifiers that link reporters with elements in the pathways. This paper proposes a novel method that aims to improve microarray reporter annotation by BLASTing the original reporter sequences against a species-specific EMBL subset, that was derived from and crosslinked back to the highly curated UniProt database. The resulting alignments were filtered using high quality alignment criteria and further compared with the outcome of a more traditional approach, where reporter sequences were BLASTed against EnsEMBL followed by locating the corresponding protein (UniProt) entry for the high quality hits. Combining the results of both methods resulted in successful annotation of > 58% of all reporter sequences with UniProt IDs on two commercial array platforms, increasing the amount of Incyte reporters that could be coupled to Gene Ontology terms from 32.7% to 58.3% and to a local GenMAPP pathway from 9.6% to 16.7%. For Agilent, 35.3% of the total reporters are now linked towards GO nodes and 7.1% on local pathways. Our methods increased the annotation quality of microarray reporter sequences and allowed us to visualize more reporters using pathway visualization tools. Even in cases where the original reporter annotation showed the correct description the new identifiers often allowed improved pathway and Gene Ontology linking. These methods are freely available at http://www.bigcat.unimaas.nl/public/publications/Gaj_Annotation/.

  17. Highlighting the Need for Systems-Level Experimental Characterization of Plant Metabolic Enzymes.

    PubMed

    Engqvist, Martin K M

    2016-01-01

    The biology of living organisms is determined by the action and interaction of a large number of individual gene products, each with specific functions. Discovering and annotating the function of gene products is key to our understanding of these organisms. Controlled experiments and bioinformatic predictions both contribute to functional gene annotation. For most species it is difficult to gain an overview of what portion of gene annotations are based on experiments and what portion represent predictions. Here, I survey the current state of experimental knowledge of enzymes and metabolism in Arabidopsis thaliana as well as eleven economically important crops and forestry trees - with a particular focus on reactions involving organic acids in central metabolism. I illustrate the limited availability of experimental data for functional annotation of enzymes in most of these species. Many enzymes involved in metabolism of citrate, malate, fumarate, lactate, and glycolate in crops and forestry trees have not been characterized. Furthermore, enzymes involved in key biosynthetic pathways which shape important traits in crops and forestry trees have not been characterized. I argue for the development of novel high-throughput platforms with which limited functional characterization of gene products can be performed quickly and relatively cheaply. I refer to this approach as systems-level experimental characterization. The data collected from such platforms would form a layer intermediate between bioinformatic gene function predictions and in-depth experimental studies of these functions. Such a data layer would greatly aid in the pursuit of understanding a multiplicity of biological processes in living organisms.

  18. Archaeal Clusters of Orthologous Genes (arCOGs): An Update and Application for Analysis of Shared Features between Thermococcales, Methanococcales, and Methanobacteriales

    PubMed Central

    Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.

    2015-01-01

    With the continuously accelerating genome sequencing from diverse groups of archaea and bacteria, accurate identification of gene orthology and availability of readily expandable clusters of orthologous genes are essential for the functional annotation of new genomes. We report an update of the collection of archaeal Clusters of Orthologous Genes (arCOGs) to cover, on average, 91% of the protein-coding genes in 168 archaeal genomes. The new arCOGs were constructed using refined algorithms for orthology identification combined with extensive manual curation, including incorporation of the results of several completed and ongoing research projects in archaeal genomics. A new level of classification is introduced, superclusters that unit two or more arCOGs and more completely reflect gene family evolution than individual, disconnected arCOGs. Assessment of the current archaeal genome annotation in public databases indicates that consistent use of arCOGs can significantly improve the annotation quality. In addition to their utility for genome annotation, arCOGs also are a platform for phylogenomic analysis. We explore this aspect of arCOGs by performing a phylogenomic study of the Thermococci that are traditionally viewed as the basal branch of the Euryarchaeota. The results of phylogenomic analysis that involved both comparison of multiple phylogenetic trees and a search for putative derived shared characters by using phyletic patterns extracted from the arCOGs reveal a likely evolutionary relationship between the Thermococci, Methanococci, and Methanobacteria. The arCOGs are expected to be instrumental for a comprehensive phylogenomic study of the archaea. PMID:25764277

  19. Metagenomics of rumen bacteriophage from thirteen lactating dairy cattle

    PubMed Central

    2013-01-01

    Background The bovine rumen hosts a diverse and complex community of Eukarya, Bacteria, Archea and viruses (including bacteriophage). The rumen viral population (the rumen virome) has received little attention compared to the rumen microbial population (the rumen microbiome). We used massively parallel sequencing of virus like particles to investigate the diversity of the rumen virome in thirteen lactating Australian Holstein dairy cattle all housed in the same location, 12 of which were sampled on the same day. Results Fourteen putative viral sequence fragments over 30 Kbp in length were assembled and annotated. Many of the putative genes in the assembled contigs showed no homology to previously annotated genes, highlighting the large amount of work still required to fully annotate the functions encoded in viral genomes. The abundance of the contig sequences varied widely between animals, even though the cattle were of the same age, stage of lactation and fed the same diets. Additionally the twelve animals which were co-habited shared a number of their dominant viral contigs. We compared the functional characteristics of our bovine viromes with that of other viromes, as well as rumen microbiomes. At the functional level, we found strong similarities between all of the viral samples, which were highly distinct from the rumen microbiome samples. Conclusions Our findings suggest a large amount of between animal variation in the bovine rumen virome and that co-habiting animals may have more similar viromes than non co-habited animals. We report the deepest sequencing to date of the rumen virome. This work highlights the enormous amount of novelty and variation present in the rumen virome. PMID:24180266

  20. pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

    PubMed Central

    Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.

    2015-01-01

    Background Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. Methods In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. Results We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/). PMID:26258475

  1. MicroScope-an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic data.

    PubMed

    Médigue, Claudine; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Gautreau, Guillaume; Josso, Adrien; Lajus, Aurélie; Langlois, Jordan; Pereira, Hugo; Planel, Rémi; Roche, David; Rollin, Johan; Rouy, Zoe; Vallenet, David

    2017-09-12

    The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources. © The Author 2017. Published by Oxford University Press.

  2. GRN2SBML: automated encoding and annotation of inferred gene regulatory networks complying with SBML.

    PubMed

    Vlaic, Sebastian; Hoffmann, Bianca; Kupfer, Peter; Weber, Michael; Dräger, Andreas

    2013-09-01

    GRN2SBML automatically encodes gene regulatory networks derived from several inference tools in systems biology markup language. Providing a graphical user interface, the networks can be annotated via the simple object access protocol (SOAP)-based application programming interface of BioMart Central Portal and minimum information required in the annotation of models registry. Additionally, we provide an R-package, which processes the output of supported inference algorithms and automatically passes all required parameters to GRN2SBML. Therefore, GRN2SBML closes a gap in the processing pipeline between the inference of gene regulatory networks and their subsequent analysis, visualization and storage. GRN2SBML is freely available under the GNU Public License version 3 and can be downloaded from http://www.hki-jena.de/index.php/0/2/490. General information on GRN2SBML, examples and tutorials are available at the tool's web page.

  3. Sheep genome functional annotation reveals proximal regulatory elements contributed to the evolution of modern breeds.

    PubMed

    Naval-Sanchez, Marina; Nguyen, Quan; McWilliam, Sean; Porto-Neto, Laercio R; Tellam, Ross; Vuocolo, Tony; Reverter, Antonio; Perez-Enciso, Miguel; Brauning, Rudiger; Clarke, Shannon; McCulloch, Alan; Zamani, Wahid; Naderi, Saeid; Rezaei, Hamid Reza; Pompanon, Francois; Taberlet, Pierre; Worley, Kim C; Gibbs, Richard A; Muzny, Donna M; Jhangiani, Shalini N; Cockett, Noelle; Daetwyler, Hans; Kijas, James

    2018-02-28

    Domestication fundamentally reshaped animal morphology, physiology and behaviour, offering the opportunity to investigate the molecular processes driving evolutionary change. Here we assess sheep domestication and artificial selection by comparing genome sequence from 43 modern breeds (Ovis aries) and their Asian mouflon ancestor (O. orientalis) to identify selection sweeps. Next, we provide a comparative functional annotation of the sheep genome, validated using experimental ChIP-Seq of sheep tissue. Using these annotations, we evaluate the impact of selection and domestication on regulatory sequences and find that sweeps are significantly enriched for protein coding genes, proximal regulatory elements of genes and genome features associated with active transcription. Finally, we find individual sites displaying strong allele frequency divergence are enriched for the same regulatory features. Our data demonstrate that remodelling of gene expression is likely to have been one of the evolutionary forces that drove phenotypic diversification of this common livestock species.

  4. Statistical assessment of crosstalk enrichment between gene groups in biological networks.

    PubMed

    McCormack, Theodore; Frings, Oliver; Alexeyenko, Andrey; Sonnhammer, Erik L L

    2013-01-01

    Analyzing groups of functionally coupled genes or proteins in the context of global interaction networks has become an important aspect of bioinformatic investigations. Assessing the statistical significance of crosstalk enrichment between or within groups of genes can be a valuable tool for functional annotation of experimental gene sets. Here we present CrossTalkZ, a statistical method and software to assess the significance of crosstalk enrichment between pairs of gene or protein groups in large biological networks. We demonstrate that the standard z-score is generally an appropriate and unbiased statistic. We further evaluate the ability of four different methods to reliably recover crosstalk within known biological pathways. We conclude that the methods preserving the second-order topological network properties perform best. Finally, we show how CrossTalkZ can be used to annotate experimental gene sets using known pathway annotations and that its performance at this task is superior to gene enrichment analysis (GEA). CrossTalkZ (available at http://sonnhammer.sbc.su.se/download/software/CrossTalkZ/) is implemented in C++, easy to use, fast, accepts various input file formats, and produces a number of statistics. These include z-score, p-value, false discovery rate, and a test of normality for the null distributions.

  5. DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures

    PubMed Central

    2013-01-01

    Background The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. Results We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. Conclusions The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis. PMID:24067102

  6. Data on the genome-wide identification of CNL R-genes in Setaria italica (L.) P. Beauv.

    PubMed

    Andersen, Ethan J; Nepal, Madhav P

    2017-08-01

    We report data associated with the identification of 242 disease resistance genes (R-genes) in the genome of Setaria italica as presented in "Genetic diversity of disease resistance genes in foxtail millet ( Setaria italica L.)" (Andersen and Nepal, 2017) [1]. Our data describe the structure and evolution of the Coiled-coil, Nucleotide-binding site, Leucine-rich repeat (CNL) R-genes in foxtail millet. The CNL genes were identified through rigorous extraction and analysis of recently available plant genome sequences using cutting-edge analytical software. Data visualization includes gene structure diagrams, chromosomal syntenic maps, a chromosomal density plot, and a maximum-likelihood phylogenetic tree comparing Sorghum bicolor , Panicum virgatum , Setaria italica , and Arabidopsis thaliana . Compilation of InterProScan annotations, Gene Ontology (GO) annotations, and Basic Local Alignment Search Tool (BLAST) results for the 242 R-genes identified in the foxtail millet genome are also included in tabular format.

  7. Genome content analysis yields new insights into the relationship between the human malaria parasite Plasmodium falciparum and its anopheline vectors.

    PubMed

    Oppenheim, Sara J; Rosenfeld, Jeffrey A; DeSalle, Rob

    2017-02-27

    The persistent and growing gap between the availability of sequenced genomes and the ability to assign functions to sequenced genes led us to explore ways to maximize the information content of automated annotation for studies of anopheline mosquitos. Specifically, we use genome content analysis of a large number of previously sequenced anopheline mosquitos to follow the loss and gain of protein families over the evolutionary history of this group. The importance of this endeavor lies in the potential for comparative genomic studies between Anopheles and closely related non-vector species to reveal ancestral genome content dynamics involved in vector competence. In addition, comparisons within Anopheles could identify genome content changes responsible for variation in the vectorial capacity of this family of important parasite vectors. The competence and capacity of P. falciparum vectors do not appear to be phylogenetically constrained within the Anophelinae. Instead, using ancestral reconstruction methods, we suggest that a previously unexamined component of vector biology, anopheline nucleotide metabolism, may contribute to the unique status of anophelines as P. falciparum vectors. While the fitness effects of nucleotide co-option by P. falciparum parasites on their anopheline hosts are not yet known, our results suggest that anopheline genome content may be responding to selection pressure from P. falciparum. Whether this response is defensive, in an attempt to redress improper nucleotide balance resulting from P. falciparum infection, or perhaps symbiotic, resulting from an as-yet-unknown mutualism between anophelines and P. falciparum, is an open question that deserves further study. Clearly, there is a wealth of functional information to be gained from detailed manual genome annotation, yet the rapid increase in the number of available sequences means that most researchers will not have the time or resources to manually annotate all the sequence data they generate. We believe that efforts to maximize the amount of information obtained from automated annotation can help address the functional annotation deficit that most evolutionary biologists now face, and here demonstrate the value of such an approach.

  8. Comparative genomics approaches to understanding and manipulating plant metabolism.

    PubMed

    Bradbury, Louis M T; Niehaus, Tom D; Hanson, Andrew D

    2013-04-01

    Over 3000 genomes, including numerous plant genomes, are now sequenced. However, their annotation remains problematic as illustrated by the many conserved genes with no assigned function, vague annotations such as 'kinase', or even wrong ones. Around 40% of genes of unknown function that are conserved between plants and microbes are probably metabolic enzymes or transporters; finding functions for these genes is a major challenge. Comparative genomics has correctly predicted functions for many such genes by analyzing genomic context, and gene fusions, distributions and co-expression. Comparative genomics complements genetic and biochemical approaches to dissect metabolism, continues to increase in power and decrease in cost, and has a pivotal role in modeling and engineering by helping identify functions for all metabolic genes. Copyright © 2012 Elsevier Ltd. All rights reserved.

  9. Concept annotation in the CRAFT corpus.

    PubMed

    Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

    2012-07-09

    Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

  10. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  11. Annotated Bibliography on the Teaching of Psychology: 2007

    ERIC Educational Resources Information Center

    Johnson, David E.; Schroder, Simone I.; Erickson, Jonathan P.; Grimes, Katherine N.

    2008-01-01

    This annotated bibliography is a continuation of those previously published in "Teaching of Psychology" (e.g., Berry & Daniel, 1984; Fulkerson & Wise, 1987; Johnson & Schroder, 1997; Wise & Fulkerson, 1996). In this paper, the authors maintained similar search methods and criteria for inclusion that were used in previous bibliographies. They also…

  12. Gene Ontology-Based Analysis of Zebrafish Omics Data Using the Web Tool Comparative Gene Ontology.

    PubMed

    Ebrahimie, Esmaeil; Fruzangohar, Mario; Moussavi Nik, Seyyed Hani; Newman, Morgan

    2017-10-01

    Gene Ontology (GO) analysis is a powerful tool in systems biology, which uses a defined nomenclature to annotate genes/proteins within three categories: "Molecular Function," "Biological Process," and "Cellular Component." GO analysis can assist in revealing functional mechanisms underlying observed patterns in transcriptomic, genomic, and proteomic data. The already extensive and increasing use of zebrafish for modeling genetic and other diseases highlights the need to develop a GO analytical tool for this organism. The web tool Comparative GO was originally developed for GO analysis of bacterial data in 2013 ( www.comparativego.com ). We have now upgraded and elaborated this web tool for analysis of zebrafish genetic data using GOs and annotations from the Gene Ontology Consortium.

  13. GenomeRNAi: a database for cell-based RNAi phenotypes.

    PubMed

    Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael

    2007-01-01

    RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at http://rnai.dkfz.de.

  14. GenomeRNAi: a database for cell-based RNAi phenotypes

    PubMed Central

    Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael

    2007-01-01

    RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at PMID:17135194

  15. Comparative Metabolomics of Mycoplasma bovis and Mycoplasma gallisepticum Reveals Fundamental Differences in Active Metabolic Pathways and Suggests Novel Gene Annotations.

    PubMed

    Masukagami, Y; De Souza, D P; Dayalan, S; Bowen, C; O'Callaghan, S; Kouremenos, K; Nijagal, B; Tull, D; Tivendale, K A; Markham, P F; McConville, M J; Browning, G F; Sansom, F M

    2017-01-01

    Mycoplasmas are simple, but successful parasites that have the smallest genome of any free-living cell and are thought to have a highly streamlined cellular metabolism. Here, we have undertaken a detailed metabolomic analysis of two species, Mycoplasma bovis and Mycoplasma gallisepticum , which cause economically important diseases in cattle and poultry, respectively. Untargeted gas chromatography-mass spectrometry and liquid chromatography-mass spectrometry analyses of mycoplasma metabolite extracts revealed significant differences in the steady-state levels of many metabolites in central carbon metabolism, while 13 C stable isotope labeling studies revealed marked differences in carbon source utilization. These data were mapped onto in silico metabolic networks predicted from genome wide annotations. The analyses elucidated distinct differences, including a clear difference in glucose utilization, with a marked decrease in glucose uptake and glycolysis in M. bovis compared to M. gallisepticum , which may reflect differing host nutrient availabilities. The 13 C-labeling patterns also revealed several functional metabolic pathways that were previously unannotated in these species, allowing us to assign putative enzyme functions to the products of a number of genes of unknown function, especially in M. bovis . This study demonstrates the considerable potential of metabolomic analyses to assist in characterizing significant differences in the metabolism of different bacterial species and in improving genome annotation. IMPORTANCE Mycoplasmas are pathogenic bacteria that cause serious chronic infections in production animals, resulting in considerable losses worldwide, as well as causing disease in humans. These bacteria have extremely reduced genomes and are thought to have limited metabolic flexibility, even though they are highly successful persistent parasites in a diverse number of species. The extent to which different Mycoplasma species are capable of catabolizing host carbon sources and nutrients, or synthesizing essential metabolites, remains poorly defined. We have used advanced metabolomic techniques to identify metabolic pathways that are active in two species of Mycoplasma that infect distinct hosts (poultry and cattle). We show that these species exhibit marked differences in metabolite steady-state levels and carbon source utilization. This information has been used to functionally characterize previously unknown genes in the genomes of these pathogens. These species-specific differences are likely to reflect important differences in host nutrient levels and pathogenic mechanisms.

  16. Differential Gene Expression Profiling of Mouse Uterine Luminal Epithelium During Periimplantation

    PubMed Central

    Xiao, Shuo; Diao, Honglu; Zhao, Fei; Li, Rong; He, Naya

    2014-01-01

    Uterine luminal epithelium (LE) is critical for establishing uterine receptivity. Microarray analysis of gestation day 3.5 (D3.5, preimplantation) and D4.5 (postimplantation) LE from natural pregnant mice identified 382 upregulated and 245 downregulated genes in the D4.5 LE. Gene Ontology annotation grouped 186 upregulated and 103 downregulated genes into 22 and 15 enriched subcategories, respectively, in regulating DNA-dependent transcription, metabolism, cell morphology, ion transport, immune response, apoptosis, signal transduction, and so on. Signaling pathway analysis revealed 99 genes in 21 significantly changed signaling pathways, with 14 of these pathways involved in metabolism. In situ hybridization confirmed the temporal expression of 12 previously uncharacterized genes, including Atp6v0a4, Atp6v0d2, F3, Ggh, Tmprss11d, Tmprss13, Anpep, Fxyd4, Naip5, Npl, Nudt19, and Tpm1 in the periimplantation uterus. This study provides a comprehensive picture of the differentially expressed genes in the periimplantation LE to help understand the molecular mechanism of LE transformation upon establishment of uterine receptivity. PMID:23885106

  17. Using scale and feather traits for module construction provides a functional approach to chicken epidermal development.

    PubMed

    Bao, Weier; Greenwold, Matthew J; Sawyer, Roger H

    2017-11-01

    Gene co-expression network analysis has been a research method widely used in systematically exploring gene function and interaction. Using the Weighted Gene Co-expression Network Analysis (WGCNA) approach to construct a gene co-expression network using data from a customized 44K microarray transcriptome of chicken epidermal embryogenesis, we have identified two distinct modules that are highly correlated with scale or feather development traits. Signaling pathways related to feather development were enriched in the traditional KEGG pathway analysis and functional terms relating specifically to embryonic epidermal development were also enriched in the Gene Ontology analysis. Significant enrichment annotations were discovered from customized enrichment tools such as Modular Single-Set Enrichment Test (MSET) and Medical Subject Headings (MeSH). Hub genes in both trait-correlated modules showed strong specific functional enrichment toward epidermal development. Also, regulatory elements, such as transcription factors and miRNAs, were targeted in the significant enrichment result. This work highlights the advantage of this methodology for functional prediction of genes not previously associated with scale- and feather trait-related modules.

  18. Altered gene expression in dry age-related macular degeneration suggests early loss of choroidal endothelial cells.

    PubMed

    Whitmore, S Scott; Braun, Terry A; Skeie, Jessica M; Haas, Christine M; Sohn, Elliott H; Stone, Edwin M; Scheetz, Todd E; Mullins, Robert F

    2013-01-01

    Age-related macular degeneration (AMD) is a major cause of blindness in developed countries. The molecular pathogenesis of early events in AMD is poorly understood. We investigated differential gene expression in samples of human retinal pigment epithelium (RPE) and choroid from early AMD and control maculas with exon-based arrays. Gene expression levels in nine human donor eyes with early AMD and nine control human donor eyes were assessed using Affymetrix Human Exon ST 1.0 arrays. Two controls did not pass quality control and were removed. Differentially expressed genes were annotated using the Database for Annotation, Visualization and Integrated Discovery (DAVID), and gene set enrichment analysis (GSEA) was performed on RPE-specific and endothelium-associated gene sets. The complement factor H (CFH) genotype was also assessed, and differential expression was analyzed regarding high AMD risk (YH/HH) and low AMD risk (YY) genotypes. Seventy-five genes were identified as differentially expressed (raw p value <0.01; ≥50% fold change, mean log2 expression level in AMD or control ≥ median of all average gene expression values); however, no genes were significant (adj. p value <0.01) after correction for multiple hypothesis testing. Of 52 genes with decreased expression in AMD (fold change <0.5; raw p value <0.01), 18 genes were identified by DAVID analysis as associated with vision or neurologic processes. The GSEA of the RPE-associated and endothelium-associated genes revealed a significant decrease in genes typically expressed by endothelial cells in the early AMD group compared to controls, consistent with previous histologic and proteomic studies. Analysis of the CFH genotype indicated decreased expression of ADAMTS9 in eyes with high-risk genotypes (fold change = -2.61; raw p value=0.0008). GSEA results suggest that RPE transcripts are preserved or elevated in early AMD, concomitant with loss of endothelial cell marker expression. These results are consistent with the notion that choroidal endothelial cell dropout or dedifferentiation occurs early in the pathogenesis of AMD.

  19. Transcriptome map of plant mitochondria reveals islands of unexpected transcribed regions.

    PubMed

    Fujii, Sota; Toda, Takushi; Kikuchi, Shunsuke; Suzuki, Ryutaro; Yokoyama, Koji; Tsuchida, Hiroko; Yano, Kentaro; Toriyama, Kinya

    2011-06-01

    Plant mitochondria contain a relatively large amount of genetic information, suggesting that their functional regulation may not be as straightforward as that of metazoans. We used a genomic tiling array to draw a transcriptomic atlas of Oryza sativa japonica (rice) mitochondria, which was predicted to be approximately 490-kb long. Whereas statistical analysis verified the transcription of all previously known functional genes such as the ones related to oxidative phosphorylation, a similar extent of RNA expression was frequently observed in the inter-genic regions where none of the previously annotated genes are located. The newly identified open reading frames (ORFs) predicted in these transcribed inter-genic regions were generally not conserved among flowering plant species, suggesting that these ORFs did not play a role in mitochondrial principal functions. We also identified two partial fragments of retrotransposon sequences as being transcribed in rice mitochondria. The present study indicated the previously unexpected complexity of plant mitochondrial RNA metabolism. Our transcriptomic data (Oryza sativa Mitochondrial rna Expression Server: OsMES) is publicly accessible at [http://bioinf.mind.meiji.ac.jp/cgi-bin/gbrowse/OsMes/#search].

  20. Plant Omics Data Center: An Integrated Web Repository for Interspecies Gene Expression Networks with NLP-Based Curation

    PubMed Central

    Ohyanagi, Hajime; Takano, Tomoyuki; Terashima, Shin; Kobayashi, Masaaki; Kanno, Maasa; Morimoto, Kyoko; Kanegae, Hiromi; Sasaki, Yohei; Saito, Misa; Asano, Satomi; Ozaki, Soichi; Kudo, Toru; Yokoyama, Koji; Aya, Koichiro; Suwabe, Keita; Suzuki, Go; Aoki, Koh; Kubo, Yasutaka; Watanabe, Masao; Matsuoka, Makoto; Yano, Kentaro

    2015-01-01

    Comprehensive integration of large-scale omics resources such as genomes, transcriptomes and metabolomes will provide deeper insights into broader aspects of molecular biology. For better understanding of plant biology, we aim to construct a next-generation sequencing (NGS)-derived gene expression network (GEN) repository for a broad range of plant species. So far we have incorporated information about 745 high-quality mRNA sequencing (mRNA-Seq) samples from eight plant species (Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, Solanum tuberosum, Medicago truncatula and Glycine max) from the public short read archive, digitally profiled the entire set of gene expression profiles, and drawn GENs by using correspondence analysis (CA) to take advantage of gene expression similarities. In order to understand the evolutionary significance of the GENs from multiple species, they were linked according to the orthology of each node (gene) among species. In addition to other gene expression information, functional annotation of the genes will facilitate biological comprehension. Currently we are improving the given gene annotations with natural language processing (NLP) techniques and manual curation. Here we introduce the current status of our analyses and the web database, PODC (Plant Omics Data Center; http://bioinf.mind.meiji.ac.jp/podc/), now open to the public, providing GENs, functional annotations and additional comprehensive omics resources. PMID:25505034

  1. Large-Scale Collection and Analysis of Full-Length cDNAs from Brachypodium distachyon and Integration with Pooideae Sequence Resources

    PubMed Central

    Mochida, Keiichi; Uehara-Yamaguchi, Yukiko; Takahashi, Fuminori; Yoshida, Takuhiro; Sakurai, Tetsuya; Shinozaki, Kazuo

    2013-01-01

    A comprehensive collection of full-length cDNAs is essential for correct structural gene annotation and functional analyses of genes. We constructed a mixed full-length cDNA library from 21 different tissues of Brachypodium distachyon Bd21, and obtained 78,163 high quality expressed sequence tags (ESTs) from both ends of ca. 40,000 clones (including 16,079 contigs). We updated gene structure annotations of Brachypodium genes based on full-length cDNA sequences in comparison with the latest publicly available annotations. About 10,000 non-redundant gene models were supported by full-length cDNAs; ca. 6,000 showed some transcription unit modifications. We also found ca. 580 novel gene models, including 362 newly identified in Bd21. Using the updated transcription start sites, we searched a total of 580 plant cis-motifs in the −3 kb promoter regions and determined a genome-wide Brachypodium promoter architecture. Furthermore, we integrated the Brachypodium full-length cDNAs and updated gene structures with available sequence resources in wheat and barley in a web-accessible database, the RIKEN Brachypodium FL cDNA database. The database represents a “one-stop” information resource for all genomic information in the Pooideae, facilitating functional analysis of genes in this model grass plant and seamless knowledge transfer to the Triticeae crops. PMID:24130698

  2. GOGrapher: A Python library for GO graph representation and analysis.

    PubMed

    Muller, Brian; Richards, Adam J; Jin, Bo; Lu, Xinghua

    2009-07-07

    The Gene Ontology is the most commonly used controlled vocabulary for annotating proteins. The concepts in the ontology are organized as a directed acyclic graph, in which a node corresponds to a biological concept and a directed edge denotes the parent-child semantic relationship between a pair of terms. A large number of protein annotations further create links between proteins and their functional annotations, reflecting the contemporary knowledge about proteins and their functional relationships. This leads to a complex graph consisting of interleaved biological concepts and their associated proteins. What is needed is a simple, open source library that provides tools to not only create and view the Gene Ontology graph, but to analyze and manipulate it as well. Here we describe the development and use of GOGrapher, a Python library that can be used for the creation, analysis, manipulation, and visualization of Gene Ontology related graphs. An object-oriented approach was adopted to organize the hierarchy of the graphs types and associated classes. An Application Programming Interface is provided through which different types of graphs can be pragmatically created, manipulated, and visualized. GOGrapher has been successfully utilized in multiple research projects, e.g., a graph-based multi-label text classifier for protein annotation. The GOGrapher project provides a reusable programming library designed for the manipulation and analysis of Gene Ontology graphs. The library is freely available for the scientific community to use and improve.

  3. Transcriptome sequencing and annotation for the Jamaican fruit bat (Artibeus jamaicensis).

    PubMed

    Shaw, Timothy I; Srivastava, Anuj; Chou, Wen-Chi; Liu, Liang; Hawkinson, Ann; Glenn, Travis C; Adams, Rick; Schountz, Tony

    2012-01-01

    The Jamaican fruit bat (Artibeus jamaicensis) is one of the most common bats in the tropical Americas. It is thought to be a potential reservoir host of Tacaribe virus, an arenavirus closely related to the South American hemorrhagic fever viruses. We performed transcriptome sequencing and annotation from lung, kidney and spleen tissues using 454 and Illumina platforms to develop this species as an animal model. More than 100,000 contigs were assembled, with 25,000 genes that were functionally annotated. Of the remaining unannotated contigs, 80% were found within bat genomes or transcriptomes. Annotated genes are involved in a broad range of activities ranging from cellular metabolism to genome regulation through ncRNAs. Reciprocal BLAST best hits yielded 8,785 sequences that are orthologous to mouse, rat, cattle, horse and human. Species tree analysis of sequences from 2,378 loci was used to achieve 95% bootstrap support for the placement of bat as sister to the clade containing horse, dog, and cattle. Through substitution rate estimation between bat and human, 32 genes were identified with evidence for positive selection. We also identified 466 immune-related genes, which may be useful for studying Tacaribe virus infection of this species. The Jamaican fruit bat transcriptome dataset is a resource that should provide additional candidate markers for studying bat evolution and ecology, and tools for analysis of the host response and pathology of disease.

  4. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis.

    PubMed

    Grossmann, Steffen; Bauer, Sebastian; Robinson, Peter N; Vingron, Martin

    2007-11-15

    High-throughput experiments such as microarray hybridizations often yield long lists of genes found to share a certain characteristic such as differential expression. Exploring Gene Ontology (GO) annotations for such lists of genes has become a widespread practice to get first insights into the potential biological meaning of the experiment. The standard statistical approach to measuring overrepresentation of GO terms cannot cope with the dependencies resulting from the structure of GO because they analyze each term in isolation. Especially the fact that annotations are inherited from more specific descendant terms can result in certain types of false-positive results with potentially misleading biological interpretation, a phenomenon which we term the inheritance problem. We present here a novel approach to analysis of GO term overrepresentation that determines overrepresentation of terms in the context of annotations to the term's parents. This approach reduces the dependencies between the individual term's measurements, and thereby avoids producing false-positive results owing to the inheritance problem. ROC analysis using study sets with overrepresented GO terms showed a clear advantage for our approach over the standard algorithm with respect to the inheritance problem. Although there can be no gold standard for exploratory methods such as analysis of GO term overrepresentation, analysis of biological datasets suggests that our algorithm tends to identify the core GO terms that are most characteristic of the dataset being analyzed.

  5. Introduction to the fathead minnow genome browser and ...

    EPA Pesticide Factsheets

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minnow genomic sequence. This work is meant to extend the utility of fathead minnow genome as a resource and enable the continued development of this species as a model organism. The fathead minnow (Pimephales promelas) is a laboratory model organism widely used in regulatory toxicity testing and ecotoxicology research. Despite, the wealth of toxicological data for this organism, until recently genome scale information was lacking for the species, which limited the utility of the species for pathway-based toxicity testing and research. As part of a EPA Pathfinder Innovation Project, next generation sequencing was applied to generate a draft genome assembly, which was published in 2016. However, application of those genome-scale sequencing resources was still limited by the lack of available gene annotations for fathead minnow. Here we report on development of a first generation genome annotation for fathead minnow and the dissemination of that information through a web-based browser that makes it easy to search for genes of interest, extract the corresponding sequence, identify intron and exon boundaries and regulatory regions, and align the computationally predicted genes with other supporti

  6. Gene Discovery through Transcriptome Sequencing for the Invasive Mussel Limnoperna fortunei

    PubMed Central

    Uliano-Silva, Marcela; Americo, Juliana Alves; Brindeiro, Rodrigo; Dondero, Francesco; Prosdocimi, Francisco; de Freitas Rebelo, Mauro

    2014-01-01

    The success of the Asian bivalve Limnoperna fortunei as an invader in South America is related to its high acclimation capability. It can inhabit waters with a wide range of temperatures and salinity and handle long-term periods of air exposure. We describe the transcriptome of L. fortunei aiming to give a first insight into the phenotypic plasticity that allows non-native taxa to become established and widespread. We sequenced 95,219 reads from five main tissues of the mussel L. fortunei using Roche’s 454 and assembled them to form a set of 84,063 unigenes (contigs and singletons) representing partial or complete gene sequences. We annotated 24,816 unigenes using a BLAST sequence similarity search against a NCBI nr database. Unigenes were divided into 20 eggNOG functional categories and 292 KEGG metabolic pathways. From the total unigenes, 1,351 represented putative full-length genes of which 73.2% were functionally annotated. We described the first partial and complete gene sequences in order to start understanding bivalve invasiveness. An expansion of the hsp70 gene family, seen also in other bivalves, is present in L. fortunei and could be involved in its adaptation to extreme environments, e.g. during intertidal periods. The presence of toll-like receptors gives a first insight into an immune system that could be more complex than previously assumed and may be involved in the prevention of disease and extinction when population densities are high. Finally, the apparent lack of special adaptations to extremely low O2 levels is a target worth pursuing for the development of a molecular control approach. PMID:25047650

  7. An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    PubMed

    Tech, Maike; Meinicke, Peter

    2006-03-09

    Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes. We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum. On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool "TICO" (TIs COrrector) which is publicly available from our web site.

  8. Genetic resources for methane production from biomass described with the Gene Ontology.

    PubMed

    Purwantini, Endang; Torto-Alalibo, Trudy; Lomax, Jane; Setubal, João C; Tyler, Brett M; Mukhopadhyay, Biswarup

    2014-01-01

    Methane (CH4) is a valuable fuel, constituting 70-95% of natural gas, and a potent greenhouse gas. Release of CH4 into the atmosphere contributes to climate change. Biological CH4 production or methanogenesis is mostly performed by methanogens, a group of strictly anaerobic archaea. The direct substrates for methanogenesis are H2 plus CO2, acetate, formate, methylamines, methanol, methyl sulfides, and ethanol or a secondary alcohol plus CO2. In numerous anaerobic niches in nature, methanogenesis facilitates mineralization of complex biopolymers such as carbohydrates, lipids and proteins generated by primary producers. Thus, methanogens are critical players in the global carbon cycle. The same process is used in anaerobic treatment of municipal, industrial and agricultural wastes, reducing the biological pollutants in the wastes and generating methane. It also holds potential for commercial production of natural gas from renewable resources. This process operates in digestive systems of many animals, including cattle, and humans. In contrast, in deep-sea hydrothermal vents methanogenesis is a primary production process, allowing chemosynthesis of biomaterials from H2 plus CO2. In this report we present Gene Ontology (GO) terms that can be used to describe processes, functions and cellular components involved in methanogenic biodegradation and biosynthesis of specialized coenzymes that methanogens use. Some of these GO terms were previously available and the rest were generated in our Microbial Energy Gene Ontology (MENGO) project. A recently discovered non-canonical CH4 production process is also described. We have performed manual GO annotation of selected methanogenesis genes, based on experimental evidence, providing "gold standards" for machine annotation and automated discovery of methanogenesis genes or systems in diverse genomes. Most of the GO-related information presented in this report is available at the MENGO website (http://www.mengo.biochem.vt.edu/).

  9. The Genomes of Three Uneven Siblings: Footprints of the Lifestyles of Three Trichoderma Species.

    PubMed

    Schmoll, Monika; Dattenböck, Christoph; Carreras-Villaseñor, Nohemí; Mendoza-Mendoza, Artemio; Tisch, Doris; Alemán, Mario Ivan; Baker, Scott E; Brown, Christopher; Cervantes-Badillo, Mayte Guadalupe; Cetz-Chel, José; Cristobal-Mondragon, Gema Rosa; Delaye, Luis; Esquivel-Naranjo, Edgardo Ulises; Frischmann, Alexa; Gallardo-Negrete, Jose de Jesus; García-Esquivel, Monica; Gomez-Rodriguez, Elida Yazmin; Greenwood, David R; Hernández-Oñate, Miguel; Kruszewska, Joanna S; Lawry, Robert; Mora-Montes, Hector M; Muñoz-Centeno, Tania; Nieto-Jacobo, Maria Fernanda; Nogueira Lopez, Guillermo; Olmedo-Monfil, Vianey; Osorio-Concepcion, Macario; Piłsyk, Sebastian; Pomraning, Kyle R; Rodriguez-Iglesias, Aroa; Rosales-Saavedra, Maria Teresa; Sánchez-Arreguín, J Alejandro; Seidl-Seiboth, Verena; Stewart, Alison; Uresti-Rivera, Edith Elena; Wang, Chih-Li; Wang, Ting-Fang; Zeilinger, Susanne; Casas-Flores, Sergio; Herrera-Estrella, Alfredo

    2016-03-01

    The genus Trichoderma contains fungi with high relevance for humans, with applications in enzyme production for plant cell wall degradation and use in biocontrol. Here, we provide a broad, comprehensive overview of the genomic content of these species for "hot topic" research aspects, including CAZymes, transport, transcription factors, and development, along with a detailed analysis and annotation of less-studied topics, such as signal transduction, genome integrity, chromatin, photobiology, or lipid, sulfur, and nitrogen metabolism in T. reesei, T. atroviride, and T. virens, and we open up new perspectives to those topics discussed previously. In total, we covered more than 2,000 of the predicted 9,000 to 11,000 genes of each Trichoderma species discussed, which is >20% of the respective gene content. Additionally, we considered available transcriptome data for the annotated genes. Highlights of our analyses include overall carbohydrate cleavage preferences due to the different genomic contents and regulation of the respective genes. We found light regulation of many sulfur metabolic genes. Additionally, a new Golgi 1,2-mannosidase likely involved in N-linked glycosylation was detected, as were indications for the ability of Trichoderma spp. to generate hybrid galactose-containing N-linked glycans. The genomic inventory of effector proteins revealed numerous compounds unique to Trichoderma, and these warrant further investigation. We found interesting expansions in the Trichoderma genus in several signaling pathways, such as G-protein-coupled receptors, RAS GTPases, and casein kinases. A particularly interesting feature absolutely unique to T. atroviride is the duplication of the alternative sulfur amino acid synthesis pathway. Copyright © 2016, American Society for Microbiology. All Rights Reserved.

  10. Transcriptomic and metabolomic analyses of cucumber fruit peels reveal a developmental increase in terpenoid glycosides associated with age-related resistance to Phytophthora capsici

    PubMed Central

    Mansfeld, Ben N; Colle, Marivi; Kang, Yunyan; Jones, A Daniel; Grumet, Rebecca

    2017-01-01

    The oomycete, Phytophthora capsici, infects cucumber (Cucumis sativus L.) fruit. An age-related resistance (ARR) to this pathogen was previously observed in fruit of cultivar ‘Vlaspik’ and shown to be associated with the peel. Young fruits are highly susceptible, but develop resistance at ~10–12 days post pollination (dpp). Peels from resistant (16 dpp) versus susceptible (8 dpp) age fruit are enriched with genes associated with defense, and methanolic extracts from resistant age peels inhibit pathogen growth. Here we compared developing fruits from ‘Vlaspik’ with those of ‘Gy14’, a line that does not exhibit ARR. Transcriptomic analysis of peels of the two lines at 8 and 16 dpp identified 80 genes that were developmentally upregulated in resistant ‘Vlaspik’ 16 dpp versus 8 dpp, but not in susceptible ‘Gy14’ at 16 dpp. A large number of these genes are annotated to be associated with defense and/or specialized metabolism, including four putative resistance (R) genes, and numerous genes involved in flavonoid and terpenoid synthesis and decoration. Untargeted metabolomic analysis was performed on extracts from 8 and 16 dpp ‘Vlaspik’ and ‘Gy14’ fruit peels using Ultra-Performance Liquid Chromatography and Quadrupole Time-of-Flight Mass Spectrometry. Multivariate analysis of the metabolomes identified 113 ions uniquely abundant in resistant ‘Vlaspik’ 16 dpp peel extracts. The most abundant compounds in this group had relative mass defects consistent with terpenoid glycosides. Two of the three most abundant ions were annotated as glycosylated nor-terpenoid esters. Together, these analyses reveal potential mechanisms by which ARR to P. capsici may be conferred. PMID:28580151

  11. Genetic resources for methane production from biomass described with the Gene Ontology

    PubMed Central

    Purwantini, Endang; Torto-Alalibo, Trudy; Lomax, Jane; Setubal, João C.; Tyler, Brett M.; Mukhopadhyay, Biswarup

    2014-01-01

    Methane (CH4) is a valuable fuel, constituting 70–95% of natural gas, and a potent greenhouse gas. Release of CH4 into the atmosphere contributes to climate change. Biological CH4 production or methanogenesis is mostly performed by methanogens, a group of strictly anaerobic archaea. The direct substrates for methanogenesis are H2 plus CO2, acetate, formate, methylamines, methanol, methyl sulfides, and ethanol or a secondary alcohol plus CO2. In numerous anaerobic niches in nature, methanogenesis facilitates mineralization of complex biopolymers such as carbohydrates, lipids and proteins generated by primary producers. Thus, methanogens are critical players in the global carbon cycle. The same process is used in anaerobic treatment of municipal, industrial and agricultural wastes, reducing the biological pollutants in the wastes and generating methane. It also holds potential for commercial production of natural gas from renewable resources. This process operates in digestive systems of many animals, including cattle, and humans. In contrast, in deep-sea hydrothermal vents methanogenesis is a primary production process, allowing chemosynthesis of biomaterials from H2 plus CO2. In this report we present Gene Ontology (GO) terms that can be used to describe processes, functions and cellular components involved in methanogenic biodegradation and biosynthesis of specialized coenzymes that methanogens use. Some of these GO terms were previously available and the rest were generated in our Microbial Energy Gene Ontology (MENGO) project. A recently discovered non-canonical CH4 production process is also described. We have performed manual GO annotation of selected methanogenesis genes, based on experimental evidence, providing “gold standards” for machine annotation and automated discovery of methanogenesis genes or systems in diverse genomes. Most of the GO-related information presented in this report is available at the MENGO website (http://www.mengo.biochem.vt.edu/). PMID:25520705

  12. DOSim: an R package for similarity between diseases based on Disease Ontology.

    PubMed

    Li, Jiang; Gong, Binsheng; Chen, Xi; Liu, Tao; Wu, Chao; Zhang, Fan; Li, Chunquan; Li, Xiang; Rao, Shaoqi; Li, Xia

    2011-06-29

    The construction of the Disease Ontology (DO) has helped promote the investigation of diseases and disease risk factors. DO enables researchers to analyse disease similarity by adopting semantic similarity measures, and has expanded our understanding of the relationships between different diseases and to classify them. Simultaneously, similarities between genes can also be analysed by their associations with similar diseases. As a result, disease heterogeneity is better understood and insights into the molecular pathogenesis of similar diseases have been gained. However, bioinformatics tools that provide easy and straight forward ways to use DO to study disease and gene similarity simultaneously are required. We have developed an R-based software package (DOSim) to compute the similarity between diseases and to measure the similarity between human genes in terms of diseases. DOSim incorporates a DO-based enrichment analysis function that can be used to explore the disease feature of an independent gene set. A multilayered enrichment analysis (GO and KEGG annotation) annotation function that helps users explore the biological meaning implied in a newly detected gene module is also part of the DOSim package. We used the disease similarity application to demonstrate the relationship between 128 different DO cancer terms. The hierarchical clustering of these 128 different cancers showed modular characteristics. In another case study, we used the gene similarity application on 361 obesity-related genes. The results revealed the complex pathogenesis of obesity. In addition, the gene module detection and gene module multilayered annotation functions in DOSim when applied on these 361 obesity-related genes helped extend our understanding of the complex pathogenesis of obesity risk phenotypes and the heterogeneity of obesity-related diseases. DOSim can be used to detect disease-driven gene modules, and to annotate the modules for functions and pathways. The DOSim package can also be used to visualise DO structure. DOSim can reflect the modular characteristic of disease related genes and promote our understanding of the complex pathogenesis of diseases. DOSim is available on the Comprehensive R Archive Network (CRAN) or http://bioinfo.hrbmu.edu.cn/dosim.

  13. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de. © The Author(s) 2014. Published by Oxford University Press.

  14. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation

    PubMed Central

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de PMID:24865352

  15. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, CY; Yang, H; Wei, CL

    Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less

  16. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    PubMed Central

    2011-01-01

    Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090

  17. Detection of alternative splice variants at the proteome level in Aspergillus flavus.

    PubMed

    Chang, Kung-Yen; Georgianna, D Ryan; Heber, Steffen; Payne, Gary A; Muddiman, David C

    2010-03-05

    Identification of proteins from proteolytic peptides or intact proteins plays an essential role in proteomics. Researchers use search engines to match the acquired peptide sequences to the target proteins. However, search engines depend on protein databases to provide candidates for consideration. Alternative splicing (AS), the mechanism where the exon of pre-mRNAs can be spliced and rearranged to generate distinct mRNA and therefore protein variants, enable higher eukaryotic organisms, with only a limited number of genes, to have the requisite complexity and diversity at the proteome level. Multiple alternative isoforms from one gene often share common segments of sequences. However, many protein databases only include a limited number of isoforms to keep minimal redundancy. As a result, the database search might not identify a target protein even with high quality tandem MS data and accurate intact precursor ion mass. We computationally predicted an exhaustive list of putative isoforms of Aspergillus flavus proteins from 20 371 expressed sequence tags to investigate whether an alternative splicing protein database can assign a greater proportion of mass spectrometry data. The newly constructed AS database provided 9807 new alternatively spliced variants in addition to 12 832 previously annotated proteins. The searches of the existing tandem MS spectra data set using the AS database identified 29 new proteins encoded by 26 genes. Nine fungal genes appeared to have multiple protein isoforms. In addition to the discovery of splice variants, AS database also showed potential to improve genome annotation. In summary, the introduction of an alternative splicing database helps identify more proteins and unveils more information about a proteome.

  18. A new approach for annotation of transposable elements using small RNA mapping

    PubMed Central

    El Baidouri, Moaine; Kim, Kyung Do; Abernathy, Brian; Arikit, Siwaret; Maumus, Florian; Panaud, Olivier; Meyers, Blake C.; Jackson, Scott A.

    2015-01-01

    Transposable elements (TEs) are mobile genomic DNA sequences found in most organisms. They so densely populate the genomes of many eukaryotic species that they are often the major constituents. With the rapid generation of many plant genome sequencing projects over the past few decades, there is an urgent need for improved TE annotation as a prerequisite for genome-wide studies. Analogous to the use of RNA-seq for gene annotation, we propose a new method for de novo TE annotation that uses as a guide 24 nt-siRNAs that are a part of TE silencing pathways. We use this new approach, called TASR (for Transposon Annotation using Small RNAs), for de novo annotation of TEs in Arabidopsis, rice and soybean and demonstrate that this strategy can be successfully applied for de novo TE annotation in plants. Executable PERL is available for download from: http://tasr-pipeline.sourceforge.net/ PMID:25813049

  19. The transcriptome of the NZ endemic sea urchin Kina (Evechinus chloroticus).

    PubMed

    Gillard, Gareth B; Garama, Daniel J; Brown, Chris M

    2014-01-20

    Sea urchins are studied as model organisms for developmental and systems biology and also produce highly valued food products. Evechinus chloroticus (Kina) is a sea urchin species that is indigenous to New Zealand. It is the type member of the Evechinus genus based on its morphological characteristics. Previous research has focused on identifying physical factors affecting commercial roe quality of E. chloroticus, but there is almost no genetic information available for E. chloroticus. E. chloroticus is the only species in its genus and has yet to be subject to molecular phylogenetic analysis. In this study we performed a de novo transcriptome assembly of Illumina sequencing data. A total of 123 million 100 base length paired-end reads were generated using RNA-Seq libraries from a range of E. chloroticus tissues from two individuals obtained from Fiordland, New Zealand. The assembly resulted in a set of 75,002 transcripts with an accepted read coverage and length, of which 24,655 transcripts could be functionally annotated using protein similarity. Transcripts could be further annotated with Gene Ontology, KEGG Orthology and InterPro terms. With this sequence data we could perform the first phylogenetic analysis of E. chloroticus to other species of its family using multiple genes. When sequences for the mitochondrial nitrogen dehydrogenase genes were compared, E. chloroticus remained outside of a family level clade, which indicated E. chloroticus is indeed a genetically distinct genus within its family. This study has produced a large set of E. chloroticus transcripts/proteins along with functional annotations, vastly increasing the amount of genomic data available for this species. This provides a resource for current and future studies on E. chloroticus, either to increase its commercial value, or its use as a model organism. The phylogenetic results provide a basis for further analysis of relationships between E. chloroticus, its family members, and its evolutionary history.

  20. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis

    PubMed Central

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-01-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765

  1. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis.

    PubMed

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-12-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or 'expressology', thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). © 2013 The Authors The Plant Journal © 2013 John Wiley & Sons Ltd.

  2. A high resolution atlas of gene expression in the domestic sheep (Ovis aries)

    PubMed Central

    Farquhar, Iseabail L.; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G.; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C. Bruce; Freeman, Tom C.; Archibald, Alan L.; Hume, David A.

    2017-01-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of ‘guilt by association’ was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages. PMID:28915238

  3. A high resolution atlas of gene expression in the domestic sheep (Ovis aries).

    PubMed

    Clark, Emily L; Bush, Stephen J; McCulloch, Mary E B; Farquhar, Iseabail L; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G; Wu, Chunlei; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C Bruce; Freeman, Tom C; Summers, Kim M; Archibald, Alan L; Hume, David A

    2017-09-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of 'guilt by association' was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages.

  4. Complete genome sequence and the expression pattern of plasmids of the model ethanologen Zymomonas mobilis ZM4 and its xylose-utilizing derivatives 8b and 2032.

    PubMed

    Yang, Shihui; Vera, Jessica M; Grass, Jeff; Savvakis, Giannis; Moskvin, Oleg V; Yang, Yongfu; McIlwain, Sean J; Lyu, Yucai; Zinonos, Irene; Hebert, Alexander S; Coon, Joshua J; Bates, Donna M; Sato, Trey K; Brown, Steven D; Himmel, Michael E; Zhang, Min; Landick, Robert; Pappas, Katherine M; Zhang, Yaoping

    2018-01-01

    Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4 and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.

  5. Analysis of Antisense Expression by Whole Genome Tiling Microarrays and siRNAs Suggests Mis-Annotation of Arabidopsis Orphan Protein-Coding Genes

    PubMed Central

    Richardson, Casey R.; Luo, Qing-Jun; Gontcharova, Viktoria; Jiang, Ying-Wen; Samanta, Manoj; Youn, Eunseog; Rock, Christopher D.

    2010-01-01

    Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation. PMID:20520764

  6. The Essential Genome of Escherichia coli K-12

    PubMed Central

    2018-01-01

    ABSTRACT Transposon-directed insertion site sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries, and therefore, it remains unclear whether the two methodologies are comparable. To address this, a high-density transposon library was constructed in Escherichia coli K-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false-positive identification of essential genes, statistical data analysis included corrections for both gene length and genome length. Through this analysis, new essential genes and genes previously incorrectly designated essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects, and fine-resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis data sets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry. PMID:29463657

  7. Phage phenomics: Physiological approaches to characterize novel viral proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  8. Phage phenomics: Physiological approaches to characterize novel viral proteins

    DOE PAGES

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.; ...

    2015-06-11

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  9. Changes in the transcriptomic profiles of maize roots in response to iron-deficiency stress.

    PubMed

    Li, Yan; Wang, Nian; Zhao, Fengtao; Song, Xuejiao; Yin, Zhaohua; Huang, Rong; Zhang, Chunqing

    2014-07-01

    Plants are often subjected to iron (Fe)-deficiency stress because of its low solubility. Plants have evolved two distinct strategies to solubilize and transport Fe to acclimate to this abiotic stress condition. Transcriptomic profiling analysis was performed using Illumina digital gene expression to understand the mechanism underlying resistance responses of roots to Fe starvation in maize, an important Strategy II plant. A total of 3,427, 4,069, 4,881, and 2,610 genes had significantly changed expression levels after Fe-deficiency treatments of 1, 2, 4 or 7 days, respectively. Genes involved in 2'-deoxymugineic acid (DMA) synthesis, secretion, and Fe(III)-DMA uptake were significantly induced. Many genes related to plant hormones, protein kinases, and protein phosphatases responded to Fe-deficiency stress, suggesting their regulatory roles in response to the Fe-deficiency stress. Functional annotation clustering analysis, using the Database for Annotation, Visualization and Integrated Discovery, revealed maize root responses to Fe starvation. This resulted in 38 functional annotation clusters: 25 for up-regulated genes, and 13 for down-regulated ones. These included genes encoding enzymes involved in the metabolism of carboxylic acids, isoprenoids and aromatic compounds, transporters, and stress response proteins. Our work provides integrated information for understanding maize response to Fe-deficiency stress.

  10. A detailed transcript-level probe annotation reveals alternative splicing based microarray platform differences

    PubMed Central

    Lee, Joseph C; Stiles, David; Lu, Jun; Cam, Margaret C

    2007-01-01

    Background Microarrays are a popular tool used in experiments to measure gene expression levels. Improving the reproducibility of microarray results produced by different chips from various manufacturers is important to create comparable and combinable experimental results. Alternative splicing has been cited as a possible cause of differences in expression measurements across platforms, though no study to this point has been conducted to show its influence in cross-platform differences. Results Using probe sequence data, a new microarray probe/transcript annotation was created based on the AceView Aug05 release that allowed for the categorization of genes based on their expression measurements' susceptibility to alternative splicing differences across microarray platforms. Examining gene expression data from multiple platforms in light of the new categorization, genes unsusceptible to alternative splicing differences showed higher signal agreement than those genes most susceptible to alternative splicing differences. The analysis gave rise to a different probe-level visualization method that can highlight probe differences according to transcript specificity. Conclusion The results highlight the need for detailed probe annotation at the transcriptome level. The presence of alternative splicing within a given sample can affect gene expression measurements and is a contributing factor to overall technical differences across platforms. PMID:17708771

  11. GeneView: a comprehensive semantic search engine for PubMed.

    PubMed

    Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf

    2012-07-01

    Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.

  12. Draft genome of the red harvester ant Pogonomyrmex barbatus.

    PubMed

    Smith, Chris R; Smith, Christopher D; Robertson, Hugh M; Helmkampf, Martin; Zimin, Aleksey; Yandell, Mark; Holt, Carson; Hu, Hao; Abouheif, Ehab; Benton, Richard; Cash, Elizabeth; Croset, Vincent; Currie, Cameron R; Elhaik, Eran; Elsik, Christine G; Favé, Marie-Julie; Fernandes, Vilaiwan; Gibson, Joshua D; Graur, Dan; Gronenberg, Wulfila; Grubbs, Kirk J; Hagen, Darren E; Viniegra, Ana Sofia Ibarraran; Johnson, Brian R; Johnson, Reed M; Khila, Abderrahman; Kim, Jay W; Mathis, Kaitlyn A; Munoz-Torres, Monica C; Murphy, Marguerite C; Mustard, Julie A; Nakamura, Rin; Niehuis, Oliver; Nigam, Surabhi; Overson, Rick P; Placek, Jennifer E; Rajakumar, Rajendhran; Reese, Justin T; Suen, Garret; Tao, Shu; Torres, Candice W; Tsutsui, Neil D; Viljakainen, Lumi; Wolschin, Florian; Gadau, Jürgen

    2011-04-05

    We report the draft genome sequence of the red harvester ant, Pogonomyrmex barbatus. The genome was sequenced using 454 pyrosequencing, and the current assembly and annotation were completed in less than 1 y. Analyses of conserved gene groups (more than 1,200 manually annotated genes to date) suggest a high-quality assembly and annotation comparable to recently sequenced insect genomes using Sanger sequencing. The red harvester ant is a model for studying reproductive division of labor, phenotypic plasticity, and sociogenomics. Although the genome of P. barbatus is similar to other sequenced hymenopterans (Apis mellifera and Nasonia vitripennis) in GC content and compositional organization, and possesses a complete CpG methylation toolkit, its predicted genomic CpG content differs markedly from the other hymenopterans. Gene networks involved in generating key differences between the queen and worker castes (e.g., wings and ovaries) show signatures of increased methylation and suggest that ants and bees may have independently co-opted the same gene regulatory mechanisms for reproductive division of labor. Gene family expansions (e.g., 344 functional odorant receptors) and pseudogene accumulation in chemoreception and P450 genes compared with A. mellifera and N. vitripennis are consistent with major life-history changes during the adaptive radiation of Pogonomyrmex spp., perhaps in parallel with the development of the North American deserts.

  13. Muscle Research and Gene Ontology: New standards for improved data integration.

    PubMed

    Feltrin, Erika; Campanaro, Stefano; Diehl, Alexander D; Ehler, Elisabeth; Faulkner, Georgine; Fordham, Jennifer; Gardin, Chiara; Harris, Midori; Hill, David; Knoell, Ralph; Laveder, Paolo; Mittempergher, Lorenza; Nori, Alessandra; Reggiani, Carlo; Sorrentino, Vincenzo; Volpe, Pompeo; Zara, Ivano; Valle, Giorgio; Deegan, Jennifer

    2009-01-29

    The Gene Ontology Project provides structured controlled vocabularies for molecular biology that can be used for the functional annotation of genes and gene products. In a collaboration between the Gene Ontology (GO) Consortium and the muscle biology community, we have made large-scale additions to the GO biological process and cellular component ontologies. The main focus of this ontology development work concerns skeletal muscle, with specific consideration given to the processes of muscle contraction, plasticity, development, and regeneration, and to the sarcomere and membrane-delimited compartments. Our aims were to update the existing structure to reflect current knowledge, and to resolve, in an accommodating manner, the ambiguity in the language used by the community. The updated muscle terminologies have been incorporated into the GO. There are now 159 new terms covering critical research areas, and 57 existing terms have been improved and reorganized to follow their usage in muscle literature. The revised GO structure should improve the interpretation of data from high-throughput (e.g. microarray and proteomic) experiments in the area of muscle science and muscle disease. We actively encourage community feedback on, and gene product annotation with these new terms. Please visit the Muscle Community Annotation Wiki http://wiki.geneontology.org/index.php/Muscle_Biology.

  14. Linking microarray reporters with protein functions

    PubMed Central

    Gaj, Stan; van Erk, Arie; van Haaften, Rachel IM; Evelo, Chris TA

    2007-01-01

    Background The analysis of microarray experiments requires accurate and up-to-date functional annotation of the microarray reporters to optimize the interpretation of the biological processes involved. Pathway visualization tools are used to connect gene expression data with existing biological pathways by using specific database identifiers that link reporters with elements in the pathways. Results This paper proposes a novel method that aims to improve microarray reporter annotation by BLASTing the original reporter sequences against a species-specific EMBL subset, that was derived from and crosslinked back to the highly curated UniProt database. The resulting alignments were filtered using high quality alignment criteria and further compared with the outcome of a more traditional approach, where reporter sequences were BLASTed against EnsEMBL followed by locating the corresponding protein (UniProt) entry for the high quality hits. Combining the results of both methods resulted in successful annotation of > 58% of all reporter sequences with UniProt IDs on two commercial array platforms, increasing the amount of Incyte reporters that could be coupled to Gene Ontology terms from 32.7% to 58.3% and to a local GenMAPP pathway from 9.6% to 16.7%. For Agilent, 35.3% of the total reporters are now linked towards GO nodes and 7.1% on local pathways. Conclusion Our methods increased the annotation quality of microarray reporter sequences and allowed us to visualize more reporters using pathway visualization tools. Even in cases where the original reporter annotation showed the correct description the new identifiers often allowed improved pathway and Gene Ontology linking. These methods are freely available at http://www.bigcat.unimaas.nl/public/publications/Gaj_Annotation/. PMID:17897448

  15. RNA-Seq analysis and transcriptome assembly for blackberry (Rubus sp. Var. Lochness) fruit.

    PubMed

    Garcia-Seco, Daniel; Zhang, Yang; Gutierrez-Mañero, Francisco J; Martin, Cathie; Ramos-Solano, Beatriz

    2015-01-22

    There is an increasing interest in berries, especially blackberries in the diet, because of recent reports of their health benefits due to their high content of flavonoids. A broad range of genomic tools are available for other Rosaceae species but these tools are still lacking in the Rubus genus, thus limiting gene discovery and the breeding of improved varieties. De novo RNA-seq of ripe blackberries grown under field conditions was performed using Illumina Hiseq 2000. Almost 9 billion nucleotide bases were sequenced in total. Following assembly, 42,062 consensus sequences were detected. For functional annotation, 33,040 (NR), 32,762 (NT), 21,932 (Swiss-Prot), 20,134 (KEGG), 13,676 (COG), 24,168 (GO) consensus sequences were annotated using different databases; in total 34,552 annotated sequences were identified. For protein prediction analysis, the number of coding DNA sequences (CDS) that mapped to the protein database was 32,540. Non redundant (NR), annotation showed that 25,418 genes (73.5%) has the highest similarity with Fragaria vesca subspecies vesca. Reanalysis was undertaken by aligning the reads with this reference genome for a deeper analysis of the transcriptome. We demonstrated that de novo assembly, using Trinity and later annotation with Blast using different databases, were complementary to alignment to the reference sequence using SOAPaligner/SOAP2. The Fragaria reference genome belongs to a species in the same family as blackberry (Rosaceae) but to a different genus. Since blackberries are tetraploids, the possibility of artefactual gene chimeras resulting from mis-assembly was tested with one of the genes sequenced by RNAseq, Chalcone Synthase (CHS). cDNAs encoding this protein were cloned and sequenced. Primers designed to the assembled sequences accurately distinguished different contigs, at least for chalcone synthase genes. We prepared and analysed transcriptome data from ripe blackberries, for which prior genomic information was limited. This new sequence information will improve the knowledge of this important and healthy fruit, providing an invaluable new tool for biological research.

  16. A comprehensive collection of annotations to interpret sequence variation in human mitochondrial transfer RNAs.

    PubMed

    Diroma, Maria Angela; Lubisco, Paolo; Attimonelli, Marcella

    2016-11-08

    The abundance of biological data characterizing the genomics era is contributing to a comprehensive understanding of human mitochondrial genetics. Nevertheless, many aspects are still unclear, specifically about the variability of the 22 human mitochondrial transfer RNA (tRNA) genes and their involvement in diseases. The complex enrichment and isolation of tRNAs in vitro leads to an incomplete knowledge of their post-transcriptional modifications and three-dimensional folding, essential for correct tRNA functioning. An accurate annotation of mitochondrial tRNA variants would be definitely useful and appreciated by mitochondrial researchers and clinicians since the most of bioinformatics tools for variant annotation and prioritization available so far cannot shed light on the functional role of tRNA variations. To this aim, we updated our MToolBox pipeline for mitochondrial DNA analysis of high throughput and Sanger sequencing data by integrating tRNA variant annotations in order to identify and characterize relevant variants not only in protein coding regions, but also in tRNA genes. The annotation step in the pipeline now provides detailed information for variants mapping onto the 22 mitochondrial tRNAs. For each mt-tRNA position along the entire genome, the relative tRNA numbering, tRNA type, cloverleaf secondary domains (loops and stems), mature nucleotide and interactions in the three-dimensional folding were reported. Moreover, pathogenicity predictions for tRNA and rRNA variants were retrieved from the literature and integrated within the annotations provided by MToolBox, both in the stand-alone version and web-based tool at the Mitochondrial Disease Sequence Data Resource (MSeqDR) website. All the information available in the annotation step of MToolBox were exploited to generate custom tracks which can be displayed in the GBrowse instance at MSeqDR website. To the best of our knowledge, specific data regarding mitochondrial variants in tRNA genes were introduced for the first time in a tool for mitochondrial genome analysis, supporting the interpretation of genetic variants in specific genomic contexts.

  17. The Pathway Coexpression Network: Revealing pathway relationships

    PubMed Central

    Tanzi, Rudolph E.

    2018-01-01

    A goal of genomics is to understand the relationships between biological processes. Pathways contribute to functional interplay within biological processes through complex but poorly understood interactions. However, limited functional references for global pathway relationships exist. Pathways from databases such as KEGG and Reactome provide discrete annotations of biological processes. Their relationships are currently either inferred from gene set enrichment within specific experiments, or by simple overlap, linking pathway annotations that have genes in common. Here, we provide a unifying interpretation of functional interaction between pathways by systematically quantifying coexpression between 1,330 canonical pathways from the Molecular Signatures Database (MSigDB) to establish the Pathway Coexpression Network (PCxN). We estimated the correlation between canonical pathways valid in a broad context using a curated collection of 3,207 microarrays from 72 normal human tissues. PCxN accounts for shared genes between annotations to estimate significant correlations between pathways with related functions rather than with similar annotations. We demonstrate that PCxN provides novel insight into mechanisms of complex diseases using an Alzheimer’s Disease (AD) case study. PCxN retrieved pathways significantly correlated with an expert curated AD gene list. These pathways have known associations with AD and were significantly enriched for genes independently associated with AD. As a further step, we show how PCxN complements the results of gene set enrichment methods by revealing relationships between enriched pathways, and by identifying additional highly correlated pathways. PCxN revealed that correlated pathways from an AD expression profiling study include functional clusters involved in cell adhesion and oxidative stress. PCxN provides expanded connections to pathways from the extracellular matrix. PCxN provides a powerful new framework for interrogation of global pathway relationships. Comprehensive exploration of PCxN can be performed at http://pcxn.org/. PMID:29554099

  18. Towards the integration, annotation and association of historical microarray experiments with RNA-seq.

    PubMed

    Chavan, Shweta S; Bauer, Michael A; Peterson, Erich A; Heuck, Christoph J; Johann, Donald J

    2013-01-01

    Transcriptome analysis by microarrays has produced important advances in biomedicine. For instance in multiple myeloma (MM), microarray approaches led to the development of an effective disease subtyping via cluster assignment, and a 70 gene risk score. Both enabled an improved molecular understanding of MM, and have provided prognostic information for the purposes of clinical management. Many researchers are now transitioning to Next Generation Sequencing (NGS) approaches and RNA-seq in particular, due to its discovery-based nature, improved sensitivity, and dynamic range. Additionally, RNA-seq allows for the analysis of gene isoforms, splice variants, and novel gene fusions. Given the voluminous amounts of historical microarray data, there is now a need to associate and integrate microarray and RNA-seq data via advanced bioinformatic approaches. Custom software was developed following a model-view-controller (MVC) approach to integrate Affymetrix probe set-IDs, and gene annotation information from a variety of sources. The tool/approach employs an assortment of strategies to integrate, cross reference, and associate microarray and RNA-seq datasets. Output from a variety of transcriptome reconstruction and quantitation tools (e.g., Cufflinks) can be directly integrated, and/or associated with Affymetrix probe set data, as well as necessary gene identifiers and/or symbols from a diversity of sources. Strategies are employed to maximize the annotation and cross referencing process. Custom gene sets (e.g., MM 70 risk score (GEP-70)) can be specified, and the tool can be directly assimilated into an RNA-seq pipeline. A novel bioinformatic approach to aid in the facilitation of both annotation and association of historic microarray data, in conjunction with richer RNA-seq data, is now assisting with the study of MM cancer biology.

  19. Expanded microbial genome coverage and improved protein family annotation in the COG database

    PubMed Central

    Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.

    2015-01-01

    Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. PMID:25428365

  20. Distinct mutations with different inheritance mode caused similar retinal dystrophies in one family: a demonstration of the importance of genetic annotations in complicated pedigrees.

    PubMed

    Chen, Xue; Sheng, Xunlun; Liu, Yani; Li, Zili; Sun, Xiantao; Jiang, Chao; Qi, Rui; Yuan, Shiqin; Wang, Xuhui; Zhou, Ge; Zhen, Yanyan; Xie, Ping; Liu, Qinghuai; Yan, Biao; Zhao, Chen

    2018-05-29

    Retinitis pigmentosa (RP) is the most common form of inherited retinal dystrophy presenting remarkable genetic heterogeneity. Genetic annotations would help with better clinical assessments and benefit gene therapy, and therefore should be recommended for RP patients. This report reveals the disease causing mutations in two RP pedigrees with confusing inheritance patterns using whole exome sequencing (WES). Twenty-five participants including eight patients from two families were recruited and received comprehensive ophthalmic evaluations. WES was applied for mutation identification. Bioinformatics annotations, intrafamilial co-segregation tests, and in silico analyses were subsequently conducted for mutation verification. All patients were clinically diagnosed with RP. The first family included two siblings born to parents with consanguineous marriage; however, no potential pathogenic variant was found shared by both patients. Further analysis revealed that the female patient carried a recurrent homozygous C8ORF37 p.W185*, while the male patient had hemizygous OFD1 p.T120A. The second family was found to segregate mutations in two genes, TULP1 and RP1. Two patients born to consanguineous marriage carried homozygous TULP1 p.R419W, while a recurrent heterozygous RP1 p.L762Yfs*17 was found in another four patients presenting an autosomal dominant inheritance pattern. Crystal structural analysis further indicated that the substitution from arginine to tryptophan at the highly conserved residue 419 of TULP1 could lead to the elimination of two hydrogen bonds between residue 419 and residues V488 and S534. All four genes, including C8ORF37, OFD1, TULP1 and RP1, have been previously implicated in RP etiology. Our study demonstrates the coexistence of diverse inheritance modes and mutations affecting distinct disease causing genes in two RP families with consanguineous marriage. Our data provide novel insights into assessments of complicated pedigrees, reinforce the genetic complexity of RP, and highlight the need for extensive molecular evaluations in such challenging families with diverse inheritance modes and mutations.

Top