Sample records for existing gene annotations

  1. Determining Semantically Related Significant Genes.

    PubMed

    Taha, Kamal

    2014-01-01

    GO relation embodies some aspects of existence dependency. If GO term xis existence-dependent on GO term y, the presence of y implies the presence of x. Therefore, the genes annotated with the function of the GO term y are usually functionally and semantically related to the genes annotated with the function of the GO term x. A large number of gene set enrichment analysis methods have been developed in recent years for analyzing gene sets enrichment. However, most of these methods overlook the structural dependencies between GO terms in GO graph by not considering the concept of existence dependency. We propose in this paper a biological search engine called RSGSearch that identifies enriched sets of genes annotated with different functions using the concept of existence dependency. We observe that GO term xcannot be existence-dependent on GO term y, if x- and y- have the same specificity (biological characteristics). After encoding into a numeric format the contributions of GO terms annotating target genes to the semantics of their lowest common ancestors (LCAs), RSGSearch uses microarray experiment to identify the most significant LCA that annotates the result genes. We evaluated RSGSearch experimentally and compared it with five gene set enrichment systems. Results showed marked improvement.

  2. Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs.

    PubMed

    Powell, Bradford C; Hutchison, Clyde A

    2006-01-19

    Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins. "Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene prediction. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect. Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes.

  3. Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs

    PubMed Central

    Powell, Bradford C; Hutchison, Clyde A

    2006-01-01

    Background Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins. Results "Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene predicion. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect. Conclusion Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes. PMID:16423288

  4. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values. PMID:18325104

  5. Discovering gene annotations in biomedical text databases.

    PubMed

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-03-06

    Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.

  6. Improved annotation with de novo transcriptome assembly in four social amoeba species.

    PubMed

    Singh, Reema; Lawal, Hajara M; Schilde, Christina; Glöckner, Gernot; Barton, Geoffrey J; Schaap, Pauline; Cole, Christian

    2017-01-31

    Annotation of gene models and transcripts is a fundamental step in genome sequencing projects. Often this is performed with automated prediction pipelines, which can miss complex and atypical genes or transcripts. RNA sequencing (RNA-seq) data can aid the annotation with empirical data. Here we present de novo transcriptome assemblies generated from RNA-seq data in four Dictyostelid species: D. discoideum, P. pallidum, D. fasciculatum and D. lacteum. The assemblies were incorporated with existing gene models to determine corrections and improvement on a whole-genome scale. This is the first time this has been performed in these eukaryotic species. An initial de novo transcriptome assembly was generated by Trinity for each species and then refined with Program to Assemble Spliced Alignments (PASA). The completeness and quality were assessed with the Benchmarking Universal Single-Copy Orthologs (BUSCO) and Transrate tools at each stage of the assemblies. The final datasets of 11,315-12,849 transcripts contained 5,610-7,712 updates and corrections to >50% of existing gene models including changes to hundreds or thousands of protein products. Putative novel genes are also identified and alternative splice isoforms were observed for the first time in P. pallidum, D. lacteum and D. fasciculatum. In taking a whole transcriptome approach to genome annotation with empirical data we have been able to enrich the annotations of four existing genome sequencing projects. In doing so we have identified updates to the majority of the gene annotations across all four species under study and found putative novel genes and transcripts which could be worthy for follow-up. The new transcriptome data we present here will be a valuable resource for genome curators in the Dictyostelia and we propose this effective methodology for use in other genome annotation projects.

  7. A graph-based semantic similarity measure for the gene ontology.

    PubMed

    Alvarez, Marco A; Yan, Changhui

    2011-12-01

    Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

  8. Maize GO annotation—methods, evaluation, and review (maize-GAMER)

    USDA-ARS?s Scientific Manuscript database

    We created a new high-coverage, robust, and reproducible functional annotation of maize protein-coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein-coding genes, respectively, this stu...

  9. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project.

    PubMed

    Aggarwal, Gautam; Worthey, E A; McDonagh, Paul D; Myler, Peter J

    2003-06-07

    Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.

  10. Evaluating Computational Gene Ontology Annotations.

    PubMed

    Škunca, Nives; Roberts, Richard J; Steffen, Martin

    2017-01-01

    Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.

  11. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing.

    PubMed

    Lagarde, Julien; Uszczynska-Ratajczak, Barbara; Carbonell, Silvia; Pérez-Lluch, Sílvia; Abad, Amaya; Davis, Carrie; Gingeras, Thomas R; Frankish, Adam; Harrow, Jennifer; Guigo, Roderic; Johnson, Rory

    2017-12-01

    Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.

  12. Neurocarta: aggregating and sharing disease-gene relations for the neurosciences.

    PubMed

    Portales-Casamar, Elodie; Ch'ng, Carolyn; Lui, Frances; St-Georges, Nicolas; Zoubarev, Anton; Lai, Artemis Y; Lee, Mark; Kwok, Cathy; Kwok, Willie; Tseng, Luchia; Pavlidis, Paul

    2013-02-26

    Understanding the genetic basis of diseases is key to the development of better diagnoses and treatments. Unfortunately, only a small fraction of the existing data linking genes to phenotypes is available through online public resources and, when available, it is scattered across multiple access tools. Neurocarta is a knowledgebase that consolidates information on genes and phenotypes across multiple resources and allows tracking and exploring of the associations. The system enables automatic and manual curation of evidence supporting each association, as well as user-enabled entry of their own annotations. Phenotypes are recorded using controlled vocabularies such as the Disease Ontology to facilitate computational inference and linking to external data sources. The gene-to-phenotype associations are filtered by stringent criteria to focus on the annotations most likely to be relevant. Neurocarta is constantly growing and currently holds more than 30,000 lines of evidence linking over 7,000 genes to 2,000 different phenotypes. Neurocarta is a one-stop shop for researchers looking for candidate genes for any disorder of interest. In Neurocarta, they can review the evidence linking genes to phenotypes and filter out the evidence they're not interested in. In addition, researchers can enter their own annotations from their experiments and analyze them in the context of existing public annotations. Neurocarta's in-depth annotation of neurodevelopmental disorders makes it a unique resource for neuroscientists working on brain development.

  13. Transcriptome profile of a bovine respiratory disease pathogen: Mannheimia haemolytica PHL213

    PubMed Central

    2012-01-01

    Background Computational methods for structural gene annotation have propelled gene discovery but face certain drawbacks with regards to prokaryotic genome annotation. Identification of transcriptional start sites, demarcating overlapping gene boundaries, and identifying regulatory elements such as small RNA are not accurate using these approaches. In this study, we re-visit the structural annotation of Mannheimia haemolytica PHL213, a bovine respiratory disease pathogen. M. haemolytica is one of the causative agents of bovine respiratory disease that results in about $3 billion annual losses to the cattle industry. We used RNA-Seq and analyzed the data using freely-available computational methods and resources. The aim was to identify previously unannotated regions of the genome using RNA-Seq based expression profile to complement the existing annotation of this pathogen. Results Using the Illumina Genome Analyzer, we generated 9,055,826 reads (average length ~76 bp) and aligned them to the reference genome using Bowtie. The transcribed regions were analyzed using SAMTOOLS and custom Perl scripts in conjunction with BLAST searches and available gene annotation information. The single nucleotide resolution map enabled the identification of 14 novel protein coding regions as well as 44 potential novel sRNA. The basal transcription profile revealed that 2,506 of the 2,837 annotated regions were expressed in vitro, at 95.25% coverage, representing all broad functional gene categories in the genome. The expression profile also helped identify 518 potential operon structures involving 1,086 co-expressed pairs. We also identified 11 proteins with mutated/alternate start codons. Conclusions The application of RNA-Seq based transcriptome profiling to structural gene annotation helped correct existing annotation errors and identify potential novel protein coding regions and sRNA. We used computational tools to predict regulatory elements such as promoters and terminators associated with the novel expressed regions for further characterization of these novel functional elements. Our study complements the existing structural annotation of Mannheimia haemolytica PHL213 based on experimental evidence. Given the role of sRNA in virulence gene regulation and stress response, potential novel sRNA described in this study can form the framework for future studies to determine the role of sRNA, if any, in M. haemolytica pathogenesis. PMID:23046475

  14. A Weighted Multipath Measurement Based on Gene Ontology for Estimating Gene Products Similarity

    PubMed Central

    Liu, Lizhen; Dai, Xuemin; Song, Wei; Lu, Jingli

    2014-01-01

    Abstract Many different methods have been proposed for calculating the semantic similarity of term pairs based on gene ontology (GO). Most existing methods are based on information content (IC), and the methods based on IC are used more commonly than those based on the structure of GO. However, most IC-based methods not only fail to handle identical annotations but also show a strong bias toward well-annotated proteins. We propose a new method called weighted multipath measurement (WMM) for estimating the semantic similarity of gene products based on the structure of the GO. We not only considered the contribution of every path between two GO terms but also took the depth of the lowest common ancestors into account. We assigned different weights for different kinds of edges in GO graph. The similarity values calculated by WMM can be reused because they are only relative to the characteristics of GO terms. Experimental results showed that the similarity values obtained by WMM have a higher accuracy. We compared the performance of WMM with that of other methods using GO data and gene annotation datasets for yeast and humans downloaded from the GO database. We found that WMM is more suited for prediction of gene function than most existing IC-based methods and that it can distinguish proteins with identical annotations (two proteins are annotated with the same terms) from each other. PMID:25229994

  15. Using comparative genome analysis to identify problems in annotated microbial genomes.

    PubMed

    Poptsova, Maria S; Gogarten, J Peter

    2010-07-01

    Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

  16. Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.

    PubMed

    Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K

    2011-01-01

    Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.

  17. A Resource of Quantitative Functional Annotation for Homo sapiens Genes.

    PubMed

    Taşan, Murat; Drabkin, Harold J; Beaver, John E; Chua, Hon Nian; Dunham, Julie; Tian, Weidong; Blake, Judith A; Roth, Frederick P

    2012-02-01

    The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.

  18. BEACON: automated tool for Bacterial GEnome Annotation ComparisON.

    PubMed

    Kalkatawi, Manal; Alam, Intikhab; Bajic, Vladimir B

    2015-08-18

    Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON's utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27%, while the number of genes without any function assignment is reduced. We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/ .

  19. Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case.

    PubMed

    Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik

    2014-12-05

    For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.

  20. sigReannot: an oligo-set re-annotation pipeline based on similarities with the Ensembl transcripts and Unigene clusters.

    PubMed

    Casel, Pierrot; Moreews, François; Lagarrigue, Sandrine; Klopp, Christophe

    2009-07-16

    Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location. The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published. SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.

  1. Bacillus anthracis genome organization in light of whole transcriptome sequencing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martin, Jeffrey; Zhu, Wenhan; Passalacqua, Karla D.

    2010-03-22

    Emerging knowledge of whole prokaryotic transcriptomes could validate a number of theoretical concepts introduced in the early days of genomics. What are the rules connecting gene expression levels with sequence determinants such as quantitative scores of promoters and terminators? Are translation efficiency measures, e.g. codon adaptation index and RBS score related to gene expression? We used the whole transcriptome shotgun sequencing of a bacterial pathogen Bacillus anthracis to assess correlation of gene expression level with promoter, terminator and RBS scores, codon adaptation index, as well as with a new measure of gene translational efficiency, average translation speed. We compared computationalmore » predictions of operon topologies with the transcript borders inferred from RNA-Seq reads. Transcriptome mapping may also improve existing gene annotation. Upon assessment of accuracy of current annotation of protein-coding genes in the B. anthracis genome we have shown that the transcriptome data indicate existence of more than a hundred genes missing in the annotation though predicted by an ab initio gene finder. Interestingly, we observed that many pseudogenes possess not only a sequence with detectable coding potential but also promoters that maintain transcriptional activity.« less

  2. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  3. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system.

    PubMed

    Chen, I-Min A; Markowitz, Victor M; Palaniappan, Krishna; Szeto, Ernest; Chu, Ken; Huang, Jinghua; Ratner, Anna; Pillay, Manoj; Hadjithomas, Michalis; Huntemann, Marcel; Mikhailova, Natalia; Ovchinnikova, Galina; Ivanova, Natalia N; Kyrpides, Nikos C

    2016-04-26

    The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existing IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.

  4. Fast gene ontology based clustering for microarray experiments.

    PubMed

    Ovaska, Kristian; Laakso, Marko; Hautaniemi, Sampsa

    2008-11-21

    Analysis of a microarray experiment often results in a list of hundreds of disease-associated genes. In order to suggest common biological processes and functions for these genes, Gene Ontology annotations with statistical testing are widely used. However, these analyses can produce a very large number of significantly altered biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses. We present fast software for advanced gene annotation using semantic similarity for Gene Ontology terms combined with clustering and heat map visualisation. The methodology allows rapid identification of genes sharing the same Gene Ontology cluster. Our R based semantic similarity open-source package has a speed advantage of over 2000-fold compared to existing implementations. From the resulting hierarchical clustering dendrogram genes sharing a GO term can be identified, and their differences in the gene expression patterns can be seen from the heat map. These methods facilitate advanced annotation of genes resulting from data analysis.

  5. Muscle Research and Gene Ontology: New standards for improved data integration

    PubMed Central

    Feltrin, Erika; Campanaro, Stefano; Diehl, Alexander D; Ehler, Elisabeth; Faulkner, Georgine; Fordham, Jennifer; Gardin, Chiara; Harris, Midori; Hill, David; Knoell, Ralph; Laveder, Paolo; Mittempergher, Lorenza; Nori, Alessandra; Reggiani, Carlo; Sorrentino, Vincenzo; Volpe, Pompeo; Zara, Ivano; Valle, Giorgio; Deegan née Clark, Jennifer

    2009-01-01

    Background The Gene Ontology Project provides structured controlled vocabularies for molecular biology that can be used for the functional annotation of genes and gene products. In a collaboration between the Gene Ontology (GO) Consortium and the muscle biology community, we have made large-scale additions to the GO biological process and cellular component ontologies. The main focus of this ontology development work concerns skeletal muscle, with specific consideration given to the processes of muscle contraction, plasticity, development, and regeneration, and to the sarcomere and membrane-delimited compartments. Our aims were to update the existing structure to reflect current knowledge, and to resolve, in an accommodating manner, the ambiguity in the language used by the community. Results The updated muscle terminologies have been incorporated into the GO. There are now 159 new terms covering critical research areas, and 57 existing terms have been improved and reorganized to follow their usage in muscle literature. Conclusion The revised GO structure should improve the interpretation of data from high-throughput (e.g. microarray and proteomic) experiments in the area of muscle science and muscle disease. We actively encourage community feedback on, and gene product annotation with these new terms. Please visit the Muscle Community Annotation Wiki . PMID:19178689

  6. Using GO-WAR for mining cross-ontology weighted association rules.

    PubMed

    Agapito, Giuseppe; Cannataro, Mario; Guzzi, Pietro Hiram; Milano, Marianna

    2015-07-01

    The Gene Ontology (GO) is a structured repository of concepts (GO terms) that are associated to one or more gene products. The process of association is referred to as annotation. The relevance and the specificity of both GO terms and annotations are evaluated by a measure defined as information content (IC). The analysis of annotated data is thus an important challenge for bioinformatics. There exist different approaches of analysis. From those, the use of association rules (AR) may provide useful knowledge, and it has been used in some applications, e.g. improving the quality of annotations. Nevertheless classical association rules algorithms do not take into account the source of annotation nor the importance yielding to the generation of candidate rules with low IC. This paper presents GO-WAR (Gene Ontology-based Weighted Association Rules) a methodology for extracting weighted association rules. GO-WAR can extract association rules with a high level of IC without loss of support and confidence from a dataset of annotated data. A case study on using of GO-WAR on publicly available GO annotation datasets is used to demonstrate that our method outperforms current state of the art approaches. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  7. Mapping annotations with textual evidence using an scLDA model.

    PubMed

    Jin, Bo; Chen, Vicky; Chen, Lujia; Lu, Xinghua

    2011-01-01

    Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.

  8. BRILIA: Integrated Tool for High-Throughput Annotation and Lineage Tree Assembly of B-Cell Repertoires.

    PubMed

    Lee, Donald W; Khavrutskii, Ilja V; Wallqvist, Anders; Bavari, Sina; Cooper, Christopher L; Chaudhury, Sidhartha

    2016-01-01

    The somatic diversity of antigen-recognizing B-cell receptors (BCRs) arises from Variable (V), Diversity (D), and Joining (J) (VDJ) recombination and somatic hypermutation (SHM) during B-cell development and affinity maturation. The VDJ junction of the BCR heavy chain forms the highly variable complementarity determining region 3 (CDR3), which plays a critical role in antigen specificity and binding affinity. Tracking the selection and mutation of the CDR3 can be useful in characterizing humoral responses to infection and vaccination. Although tens to hundreds of thousands of unique BCR genes within an expressed B-cell repertoire can now be resolved with high-throughput sequencing, tracking SHMs is still challenging because existing annotation methods are often limited by poor annotation coverage, inconsistent SHM identification across the VDJ junction, or lack of B-cell lineage data. Here, we present B-cell repertoire inductive lineage and immunosequence annotator (BRILIA), an algorithm that leverages repertoire-wide sequencing data to globally improve the VDJ annotation coverage, lineage tree assembly, and SHM identification. On benchmark tests against simulated human and mouse BCR repertoires, BRILIA correctly annotated germline and clonally expanded sequences with 94 and 70% accuracy, respectively, and it has a 90% SHM-positive prediction rate in the CDR3 of heavily mutated sequences; these are substantial improvements over existing methods. We used BRILIA to process BCR sequences obtained from splenic germinal center B cells extracted from C57BL/6 mice. BRILIA returned robust B-cell lineage trees and yielded SHM patterns that are consistent across the VDJ junction and agree with known biological mechanisms of SHM. By contrast, existing BCR annotation tools, which do not account for repertoire-wide clonal relationships, systematically underestimated both the size of clonally related B-cell clusters and yielded inconsistent SHM frequencies. We demonstrate BRILIA's utility in B-cell repertoire studies related to VDJ gene usage, mechanisms for adenosine mutations, and SHM hot spot motifs. Furthermore, we show that the complete gene usage annotation and SHM identification across the entire CDR3 are essential for studying the B-cell affinity maturation process through immunosequencing methods.

  9. Muscle Research and Gene Ontology: New standards for improved data integration.

    PubMed

    Feltrin, Erika; Campanaro, Stefano; Diehl, Alexander D; Ehler, Elisabeth; Faulkner, Georgine; Fordham, Jennifer; Gardin, Chiara; Harris, Midori; Hill, David; Knoell, Ralph; Laveder, Paolo; Mittempergher, Lorenza; Nori, Alessandra; Reggiani, Carlo; Sorrentino, Vincenzo; Volpe, Pompeo; Zara, Ivano; Valle, Giorgio; Deegan, Jennifer

    2009-01-29

    The Gene Ontology Project provides structured controlled vocabularies for molecular biology that can be used for the functional annotation of genes and gene products. In a collaboration between the Gene Ontology (GO) Consortium and the muscle biology community, we have made large-scale additions to the GO biological process and cellular component ontologies. The main focus of this ontology development work concerns skeletal muscle, with specific consideration given to the processes of muscle contraction, plasticity, development, and regeneration, and to the sarcomere and membrane-delimited compartments. Our aims were to update the existing structure to reflect current knowledge, and to resolve, in an accommodating manner, the ambiguity in the language used by the community. The updated muscle terminologies have been incorporated into the GO. There are now 159 new terms covering critical research areas, and 57 existing terms have been improved and reorganized to follow their usage in muscle literature. The revised GO structure should improve the interpretation of data from high-throughput (e.g. microarray and proteomic) experiments in the area of muscle science and muscle disease. We actively encourage community feedback on, and gene product annotation with these new terms. Please visit the Muscle Community Annotation Wiki http://wiki.geneontology.org/index.php/Muscle_Biology.

  10. Improved Annotation of 3′ Untranslated Regions and Complex Loci by Combination of Strand-Specific Direct RNA Sequencing, RNA-Seq and ESTs

    PubMed Central

    Song, Junfang; Duc, Céline; Storey, Kate G.; McLean, W. H. Irwin; Brown, Sara J.; Simpson, Gordon G.; Barton, Geoffrey J.

    2014-01-01

    The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct and complete annotation in addition to the underlying genomic sequence is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3′ untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3′ polyadenylation sites to within +/− 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3′ UTR re-annotation (including extension of one 3′ UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data. PMID:24722185

  11. Solving the Problem: Genome Annotation Standards before the Data Deluge.

    PubMed

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana

    2011-10-15

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

  12. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  13. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud[OPEN

    PubMed Central

    Merchant, Nirav

    2016-01-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. PMID:27020957

  14. xGDBvm: A Web GUI-Driven Workflow for Annotating Eukaryotic Genomes in the Cloud.

    PubMed

    Duvick, Jon; Standage, Daniel S; Merchant, Nirav; Brendel, Volker P

    2016-04-01

    Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today's pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant's Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching. © 2016 American Society of Plant Biologists. All rights reserved.

  15. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes.

    PubMed

    Zhang, Jia; Yang, Ming-Kun; Zeng, Honghui; Ge, Feng

    2016-11-01

    Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.

  16. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

    PubMed

    Paul, Sandip; Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2015-12-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing the pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen - a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for a species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars - Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. Copyright © 2015 Elsevier Inc. All rights reserved.

  17. Novel strategies to mine alcoholism-related haplotypes and genes by combining existing knowledge framework.

    PubMed

    Zhang, RuiJie; Li, Xia; Jiang, YongShuai; Liu, GuiYou; Li, ChuanXing; Zhang, Fan; Xiao, Yun; Gong, BinSheng

    2009-02-01

    High-throughout single nucleotide polymorphism detection technology and the existing knowledge provide strong support for mining the disease-related haplotypes and genes. In this study, first, we apply four kinds of haplotype identification methods (Confidence Intervals, Four Gamete Tests, Solid Spine of LD and fusing method of haplotype block) into high-throughout SNP genotype data to identify blocks, then use cluster analysis to verify the effectiveness of the four methods, and select the alcoholism-related SNP haplotypes through risk analysis. Second, we establish a mapping from haplotypes to alcoholism-related genes. Third, we inquire NCBI SNP and gene databases to locate the blocks and identify the candidate genes. In the end, we make gene function annotation by KEGG, Biocarta, and GO database. We find 159 haplotype blocks, which relate to the alcoholism most possibly on chromosome 1 approximately 22, including 227 haplotypes, of which 102 SNP haplotypes may increase the risk of alcoholism. We get 121 alcoholism-related genes and verify their reliability by the functional annotation of biology. In a word, we not only can handle the SNP data easily, but also can locate the disease-related genes precisely by combining our novel strategies of mining alcoholism-related haplotypes and genes with existing knowledge framework.

  18. FusionHub: A unified web platform for annotation and visualization of gene fusion events in human cancer.

    PubMed

    Panigrahi, Priyabrata; Jere, Abhay; Anamika, Krishanpal

    2018-01-01

    Gene fusion is a chromosomal rearrangement event which plays a significant role in cancer due to the oncogenic potential of the chimeric protein generated through fusions. At present many databases are available in public domain which provides detailed information about known gene fusion events and their functional role. Existing gene fusion detection tools, based on analysis of transcriptomics data usually report a large number of fusion genes as potential candidates, which could be either known or novel or false positives. Manual annotation of these putative genes is indeed time-consuming. We have developed a web platform FusionHub, which acts as integrated search engine interfacing various fusion gene databases and simplifies large scale annotation of fusion genes in a seamless way. In addition, FusionHub provides three ways of visualizing fusion events: circular view, domain architecture view and network view. Design of potential siRNA molecules through ensemble method is another utility integrated in FusionHub that could aid in siRNA-based targeted therapy. FusionHub is freely available at https://fusionhub.persistent.co.in.

  19. EGASP: the human ENCODE Genome Annotation Assessment Project

    PubMed Central

    Guigó, Roderic; Flicek, Paul; Abril, Josep F; Reymond, Alexandre; Lagarde, Julien; Denoeud, France; Antonarakis, Stylianos; Ashburner, Michael; Bajic, Vladimir B; Birney, Ewan; Castelo, Robert; Eyras, Eduardo; Ucla, Catherine; Gingeras, Thomas R; Harrow, Jennifer; Hubbard, Tim; Lewis, Suzanna E; Reese, Martin G

    2006-01-01

    Background We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusion This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence. PMID:16925836

  20. Construction of ontology augmented networks for protein complex prediction.

    PubMed

    Zhang, Yijia; Lin, Hongfei; Yang, Zhihao; Wang, Jian

    2013-01-01

    Protein complexes are of great importance in understanding the principles of cellular organization and function. The increase in available protein-protein interaction data, gene ontology and other resources make it possible to develop computational methods for protein complex prediction. Most existing methods focus mainly on the topological structure of protein-protein interaction networks, and largely ignore the gene ontology annotation information. In this article, we constructed ontology augmented networks with protein-protein interaction data and gene ontology, which effectively unified the topological structure of protein-protein interaction networks and the similarity of gene ontology annotations into unified distance measures. After constructing ontology augmented networks, a novel method (clustering based on ontology augmented networks) was proposed to predict protein complexes, which was capable of taking into account the topological structure of the protein-protein interaction network, as well as the similarity of gene ontology annotations. Our method was applied to two different yeast protein-protein interaction datasets and predicted many well-known complexes. The experimental results showed that (i) ontology augmented networks and the unified distance measure can effectively combine the structure closeness and gene ontology annotation similarity; (ii) our method is valuable in predicting protein complexes and has higher F1 and accuracy compared to other competing methods.

  1. A Transcriptome Map of Actinobacillus pleuropneumoniae at Single-Nucleotide Resolution Using Deep RNA-Seq

    PubMed Central

    Su, Zhipeng; Zhu, Jiawen; Xu, Zhuofei; Xiao, Ran; Zhou, Rui; Li, Lu; Chen, Huanchun

    2016-01-01

    Actinobacillus pleuropneumoniae is the pathogen of porcine contagious pleuropneumoniae, a highly contagious respiratory disease of swine. Although the genome of A. pleuropneumoniae was sequenced several years ago, limited information is available on the genome-wide transcriptional analysis to accurately annotate the gene structures and regulatory elements. High-throughput RNA sequencing (RNA-seq) has been applied to study the transcriptional landscape of bacteria, which can efficiently and accurately identify gene expression regions and unknown transcriptional units, especially small non-coding RNAs (sRNAs), UTRs and regulatory regions. The aim of this study is to comprehensively analyze the transcriptome of A. pleuropneumoniae by RNA-seq in order to improve the existing genome annotation and promote our understanding of A. pleuropneumoniae gene structures and RNA-based regulation. In this study, we utilized RNA-seq to construct a single nucleotide resolution transcriptome map of A. pleuropneumoniae. More than 3.8 million high-quality reads (average length ~90 bp) from a cDNA library were generated and aligned to the reference genome. We identified 32 open reading frames encoding novel proteins that were mis-annotated in the previous genome annotations. The start sites for 35 genes based on the current genome annotation were corrected. Furthermore, 51 sRNAs in the A. pleuropneumoniae genome were discovered, of which 40 sRNAs were never reported in previous studies. The transcriptome map also enabled visualization of 5'- and 3'-UTR regions, in which contained 11 sRNAs. In addition, 351 operons covering 1230 genes throughout the whole genome were identified. The RNA-Seq based transcriptome map validated annotated genes and corrected annotations of open reading frames in the genome, and led to the identification of many functional elements (e.g. regions encoding novel proteins, non-coding sRNAs and operon structures). The transcriptional units described in this study provide a foundation for future studies concerning the gene functions and the transcriptional regulatory architectures of this pathogen. PMID:27018591

  2. AgBase: supporting functional modeling in agricultural organisms

    PubMed Central

    McCarthy, Fiona M.; Gresham, Cathy R.; Buza, Teresia J.; Chouvarine, Philippe; Pillai, Lakshmi R.; Kumar, Ranjit; Ozkan, Seval; Wang, Hui; Manda, Prashanti; Arick, Tony; Bridges, Susan M.; Burgess, Shane C.

    2011-01-01

    AgBase (http://www.agbase.msstate.edu/) provides resources to facilitate modeling of functional genomics data and structural and functional annotation of agriculturally important animal, plant, microbe and parasite genomes. The website is redesigned to improve accessibility and ease of use, including improved search capabilities. Expanded capabilities include new dedicated pages for horse, cat, dog, cotton, rice and soybean. We currently provide 590 240 Gene Ontology (GO) annotations to 105 454 gene products in 64 different species, including GO annotations linked to transcripts represented on agricultural microarrays. For many of these arrays, this provides the only functional annotation available. GO annotations are available for download and we provide comprehensive, species-specific GO annotation files for 18 different organisms. The tools available at AgBase have been expanded and several existing tools improved based upon user feedback. One of seven new tools available at AgBase, GOModeler, supports hypothesis testing from functional genomics data. We host several associated databases and provide genome browsers for three agricultural pathogens. Moreover, we provide comprehensive training resources (including worked examples and tutorials) via links to Educational Resources at the AgBase website. PMID:21075795

  3. Semantic similarity analysis of protein data: assessment with biological features and issues.

    PubMed

    Guzzi, Pietro H; Mina, Marco; Guerra, Concettina; Cannataro, Mario

    2012-09-01

    The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.

  4. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach.

    PubMed

    Peng, Jiajie; Zhang, Xuanshuo; Hui, Weiwei; Lu, Junya; Li, Qianqian; Liu, Shuhui; Shang, Xuequn

    2018-03-19

    Gene Ontology (GO) is one of the most popular bioinformatics resources. In the past decade, Gene Ontology-based gene semantic similarity has been effectively used to model gene-to-gene interactions in multiple research areas. However, most existing semantic similarity approaches rely only on GO annotations and structure, or incorporate only local interactions in the co-functional network. This may lead to inaccurate GO-based similarity resulting from the incomplete GO topology structure and gene annotations. We present NETSIM2, a new network-based method that allows researchers to measure GO-based gene functional similarities by considering the global structure of the co-functional network with a random walk with restart (RWR)-based method, and by selecting the significant term pairs to decrease the noise information. Based on the EC number (Enzyme Commission)-based groups of yeast and Arabidopsis, evaluation test shows that NETSIM2 can enhance the accuracy of Gene Ontology-based gene functional similarity. Using NETSIM2 as an example, we found that the accuracy of semantic similarities can be significantly improved after effectively incorporating the global gene-to-gene interactions in the co-functional network, especially on the species that gene annotations in GO are far from complete.

  5. Ontology design patterns to disambiguate relations between genes and gene products in GENIA

    PubMed Central

    2011-01-01

    Motivation Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences. Results We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications. Availability Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/. PMID:22166341

  6. MetaStorm: A Public Resource for Customizable Metagenomics Annotation

    PubMed Central

    Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S.; Pruden, Amy; Xiao, Weidong; Zhang, Liqing

    2016-01-01

    Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution. PMID:27632579

  7. MetaStorm: A Public Resource for Customizable Metagenomics Annotation.

    PubMed

    Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S; Pruden, Amy; Xiao, Weidong; Zhang, Liqing

    2016-01-01

    Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution.

  8. PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature.

    PubMed

    Yoo, Danny; Xu, Iris; Berardini, Tanya Z; Rhee, Seung Yon; Narayanasamy, Vijay; Twigger, Simon

    2006-03-01

    For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.

  9. The coffee genome hub: a resource for coffee genomes

    PubMed Central

    Dereeper, Alexis; Bocs, Stéphanie; Rouard, Mathieu; Guignon, Valentin; Ravel, Sébastien; Tranchant-Dubreuil, Christine; Poncet, Valérie; Garsmeur, Olivier; Lashermes, Philippe; Droc, Gaëtan

    2015-01-01

    The whole genome sequence of Coffea canephora, the perennial diploid species known as Robusta, has been recently released. In the context of the C. canephora genome sequencing project and to support post-genomics efforts, we developed the Coffee Genome Hub (http://coffee-genome.org/), an integrative genome information system that allows centralized access to genomics and genetics data and analysis tools to facilitate translational and applied research in coffee. We provide the complete genome sequence of C. canephora along with gene structure, gene product information, metabolism, gene families, transcriptomics, syntenic blocks, genetic markers and genetic maps. The hub relies on generic software (e.g. GMOD tools) for easy querying, visualizing and downloading research data. It includes a Genome Browser enhanced by a Community Annotation System, enabling the improvement of automatic gene annotation through an annotation editor. In addition, the hub aims at developing interoperability among other existing South Green tools managing coffee data (phylogenomics resources, SNPs) and/or supporting data analyses with the Galaxy workflow manager. PMID:25392413

  10. Incidence of genome structure, DNA asymmetry, and cell physiology on T-DNA integration in chromosomes of the phytopathogenic fungus Leptosphaeria maculans.

    PubMed

    Bourras, Salim; Meyer, Michel; Grandaubert, Jonathan; Lapalu, Nicolas; Fudal, Isabelle; Linglin, Juliette; Ollivier, Benedicte; Blaise, Françoise; Balesdent, Marie-Hélène; Rouxel, Thierry

    2012-08-01

    The ever-increasing generation of sequence data is accompanied by unsatisfactory functional annotation, and complex genomes, such as those of plants and filamentous fungi, show a large number of genes with no predicted or known function. For functional annotation of unknown or hypothetical genes, the production of collections of mutants using Agrobacterium tumefaciens-mediated transformation (ATMT) associated with genotyping and phenotyping has gained wide acceptance. ATMT is also widely used to identify pathogenicity determinants in pathogenic fungi. A systematic analysis of T-DNA borders was performed in an ATMT-mutagenized collection of the phytopathogenic fungus Leptosphaeria maculans to evaluate the features of T-DNA integration in its particular transposable element-rich compartmentalized genome. A total of 318 T-DNA tags were recovered and analyzed for biases in chromosome and genic compartments, existence of CG/AT skews at the insertion site, and occurrence of microhomologies between the T-DNA left border (LB) and the target sequence. Functional annotation of targeted genes was done using the Gene Ontology annotation. The T-DNA integration mainly targeted gene-rich, transcriptionally active regions, and it favored biological processes consistent with the physiological status of a germinating spore. T-DNA integration was strongly biased toward regulatory regions, and mainly promoters. Consistent with the T-DNA intranuclear-targeting model, the density of T-DNA insertion correlated with CG skew near the transcription initiation site. The existence of microhomologies between promoter sequences and the T-DNA LB flanking sequence was also consistent with T-DNA integration to host DNA mediated by homologous recombination based on the microhomology-mediated end-joining pathway.

  11. Extending bicluster analysis to annotate unclassified ORFs and predict novel functional modules using expression data

    PubMed Central

    Bryan, Kenneth; Cunningham, Pádraig

    2008-01-01

    Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786

  12. GARNET--gene set analysis with exploration of annotation relations.

    PubMed

    Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu

    2011-02-15

    Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).

  13. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    PubMed

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  14. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  15. RGmatch: matching genomic regions to proximal genes in omics data integration.

    PubMed

    Furió-Tarí, Pedro; Conesa, Ana; Tarazona, Sonia

    2016-11-22

    The integrative analysis of multiple genomics data often requires that genome coordinates-based signals have to be associated with proximal genes. The relative location of a genomic region with respect to the gene (gene area) is important for functional data interpretation; hence algorithms that match regions to genes should be able to deliver insight into this information. In this work we review the tools that are publicly available for making region-to-gene associations. We also present a novel method, RGmatch, a flexible and easy-to-use Python tool that computes associations either at the gene, transcript, or exon level, applying a set of rules to annotate each region-gene association with the region location within the gene. RGmatch can be applied to any organism as long as genome annotation is available. Furthermore, we qualitatively and quantitatively compare RGmatch to other tools. RGmatch simplifies the association of a genomic region with its closest gene. At the same time, it is a powerful tool because the rules used to annotate these associations are very easy to modify according to the researcher's specific interests. Some important differences between RGmatch and other similar tools already in existence are RGmatch's flexibility, its wide range of user options, compatibility with any annotatable organism, and its comprehensive and user-friendly output.

  16. Representing virus-host interactions and other multi-organism processes in the Gene Ontology.

    PubMed

    Foulger, R E; Osumi-Sutherland, D; McIntosh, B K; Hulo, C; Masson, P; Poux, S; Le Mercier, P; Lomax, J

    2015-07-28

    The Gene Ontology project is a collaborative effort to provide descriptions of gene products in a consistent and computable language, and in a species-independent manner. The Gene Ontology is designed to be applicable to all organisms but up to now has been largely under-utilized for prokaryotes and viruses, in part because of a lack of appropriate ontology terms. To address this issue, we have developed a set of Gene Ontology classes that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology. Describing microbial and viral gene products brings with it the additional challenge of capturing both the host and the microbe. Recognising this, we have worked closely with annotation groups to test and optimize the GO classes, and we describe here a set of annotation guidelines that allow the controlled description of two interacting organisms. Building on the microbial resources already in existence such as ViralZone, UniProtKB keywords and MeGO, this project provides an integrated ontology to describe interactions between microbial species and their hosts, with mappings to the external resources above. Housing this information within the freely-accessible Gene Ontology project allows the classes and annotation structure to be utilized by a large community of biologists and users.

  17. ezTag: tagging biomedical concepts via interactive learning.

    PubMed

    Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong

    2018-05-18

    Recently, advanced text-mining techniques have been shown to speed up manual data curation by providing human annotators with automated pre-annotations generated by rules or machine learning models. Due to the limited training data available, however, current annotation systems primarily focus only on common concept types such as genes or diseases. To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. It also provides lexicon-based concept tagging as well as the state-of-the-art pre-trained taggers such as TaggerOne, GNormPlus and tmVar. ezTag is freely available at http://eztag.bioqrator.org.

  18. GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

    PubMed Central

    Aubourg, Sébastien; Brunaud, Véronique; Bruyère, Clémence; Cock, Mark; Cooke, Richard; Cottet, Annick; Couloux, Arnaud; Déhais, Patrice; Deléage, Gilbert; Duclert, Aymeric; Echeverria, Manuel; Eschbach, Aimée; Falconet, Denis; Filippi, Ghislain; Gaspin, Christine; Geourjon, Christophe; Grienenberger, Jean-Michel; Houlné, Guy; Jamet, Elisabeth; Lechauve, Frédéric; Leleu, Olivier; Leroy, Philippe; Mache, Régis; Meyer, Christian; Nedjari, Hafed; Negrutiu, Ioan; Orsini, Valérie; Peyretaillade, Eric; Pommier, Cyril; Raes, Jeroen; Risler, Jean-Loup; Rivière, Stéphane; Rombauts, Stéphane; Rouzé, Pierre; Schneider, Michel; Schwob, Philippe; Small, Ian; Soumayet-Kampetenga, Ghislain; Stankovski, Darko; Toffano, Claire; Tognolli, Michael; Caboche, Michel; Lecharny, Alain

    2005-01-01

    Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot. PMID:15608279

  19. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus.

    PubMed

    Carroll, Ronan K; Weiss, Andy; Broach, William H; Wiemels, Richard E; Mogen, Austin B; Rice, Kelly C; Shaw, Lindsey N

    2016-02-09

    In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. Despite a large number of studies identifying regulatory or small RNA (sRNA) genes in Staphylococcus aureus, their annotation is notably lacking in available genome files. In addition to this, there has been a considerable lack of cross-referencing in the wealth of studies identifying these elements, often leading to the same sRNA being identified multiple times and bearing multiple names. In this work, we have consolidated and curated known sRNA genes from the literature and mapped them to their position on the S. aureus genome, creating new genome annotation files. These files can now be used by the scientific community at large in experiments to search for previously undiscovered sRNA genes and to monitor sRNA gene expression by transcriptome sequencing (RNA-seq). We demonstrate this application, identifying 39 new sRNAs and studying their expression during S. aureus growth in human serum. Copyright © 2016 Carroll et al.

  20. NoGOA: predicting noisy GO annotations using evidences and sparse representation.

    PubMed

    Yu, Guoxian; Lu, Chang; Wang, Jun

    2017-07-21

    Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .

  1. Proteins of Unknown Biochemical Function: A Persistent Problem and a Roadmap to Help Overcome It.

    PubMed

    Niehaus, Thomas D; Thamm, Antje M K; de Crécy-Lagard, Valérie; Hanson, Andrew D

    2015-11-01

    The number of sequenced genomes is rapidly increasing, but functional annotation of the genes in these genomes lags far behind. Even in Arabidopsis (Arabidopsis thaliana), only approximately 40% of enzyme- and transporter-encoding genes have credible functional annotations, and this number is even lower in nonmodel plants. Functional characterization of unknown genes is a challenge, but various databases (e.g. for protein localization and coexpression) can be mined to provide clues. If homologous microbial genes exist-and about one-half the genes encoding unknown enzymes and transporters in Arabidopsis have microbial homologs-cross-kingdom comparative genomics can powerfully complement plant-based data. Multiple lines of evidence can strengthen predictions and warrant experimental characterization. In some cases, relatively quick tests in genetically tractable microbes can determine whether a prediction merits biochemical validation, which is costly and demands specialized skills. © 2015 American Society of Plant Biologists. All Rights Reserved.

  2. Detecting long tandem duplications in genomic sequences.

    PubMed

    Audemard, Eric; Schiex, Thomas; Faraut, Thomas

    2012-05-08

    Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication. In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,(a) we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS  <  1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes. ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.

  3. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

    PubMed

    Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E; Thomas, Paul D

    2016-01-01

    We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo. © The Author(s) 2016. Published by Oxford University Press.

  4. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system

    DOE PAGES

    Chen, I-Min A.; Markowitz, Victor M.; Palaniappan, Krishna; ...

    2016-04-26

    Background: The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Results: Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existingmore » IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. Conclusion: By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.« less

  5. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, I-Min A.; Markowitz, Victor M.; Palaniappan, Krishna

    Background: The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a "Wiki-based" approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation. Results: Here, we present a different approach, relying on tightly integrated method rather than "Wiki-based" method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existingmore » IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review. Conclusion: By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.« less

  6. Cross-organism learning method to discover new gene functionalities.

    PubMed

    Domeniconi, Giacomo; Masseroli, Marco; Moro, Gianluca; Pinoli, Pietro

    2016-04-01

    Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  7. Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1

    PubMed Central

    Berardini, Tanya Z.; Mundodi, Suparna; Reiser, Leonore; Huala, Eva; Garcia-Hernandez, Margarita; Zhang, Peifen; Mueller, Lukas A.; Yoon, Jungwoon; Doyle, Aisling; Lander, Gabriel; Moseyko, Nick; Yoo, Danny; Xu, Iris; Zoeckler, Brandon; Montoya, Mary; Miller, Neil; Weems, Dan; Rhee, Seung Y.

    2004-01-01

    Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species. PMID:15173566

  8. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus

    PubMed Central

    Weiss, Andy; Broach, William H.; Wiemels, Richard E.; Mogen, Austin B.; Rice, Kelly C.

    2016-01-01

    ABSTRACT In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. PMID:26861020

  9. Linking microarray reporters with protein functions.

    PubMed

    Gaj, Stan; van Erk, Arie; van Haaften, Rachel I M; Evelo, Chris T A

    2007-09-26

    The analysis of microarray experiments requires accurate and up-to-date functional annotation of the microarray reporters to optimize the interpretation of the biological processes involved. Pathway visualization tools are used to connect gene expression data with existing biological pathways by using specific database identifiers that link reporters with elements in the pathways. This paper proposes a novel method that aims to improve microarray reporter annotation by BLASTing the original reporter sequences against a species-specific EMBL subset, that was derived from and crosslinked back to the highly curated UniProt database. The resulting alignments were filtered using high quality alignment criteria and further compared with the outcome of a more traditional approach, where reporter sequences were BLASTed against EnsEMBL followed by locating the corresponding protein (UniProt) entry for the high quality hits. Combining the results of both methods resulted in successful annotation of > 58% of all reporter sequences with UniProt IDs on two commercial array platforms, increasing the amount of Incyte reporters that could be coupled to Gene Ontology terms from 32.7% to 58.3% and to a local GenMAPP pathway from 9.6% to 16.7%. For Agilent, 35.3% of the total reporters are now linked towards GO nodes and 7.1% on local pathways. Our methods increased the annotation quality of microarray reporter sequences and allowed us to visualize more reporters using pathway visualization tools. Even in cases where the original reporter annotation showed the correct description the new identifiers often allowed improved pathway and Gene Ontology linking. These methods are freely available at http://www.bigcat.unimaas.nl/public/publications/Gaj_Annotation/.

  10. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kolker, Eugene

    Our project focused primarily on analysis of different types of data produced by global high-throughput technologies, data integration of gene annotation, and gene and protein expression information, as well as on getting a better functional annotation of Shewanella genes. Specifically, four of our numerous major activities and achievements include the development of: statistical models for identification and expression proteomics, superior to currently available approaches (including our own earlier ones); approaches to improve gene annotations on the whole-organism scale; standards for annotation, transcriptomics and proteomics approaches; and generalized approaches for data integration of gene annotation, gene and protein expression information.

  11. ExAtlas: An interactive online tool for meta-analysis of gene expression data.

    PubMed

    Sharov, Alexei A; Schlessinger, David; Ko, Minoru S H

    2015-12-01

    We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users' own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher's methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein-protein interaction) are pre-loaded and can be used for functional annotations.

  12. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  13. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. Conclusions This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig’s adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response. PMID:23676093

  14. Using phylogenetically-informed annotation (PIA) to search for light-interacting genes in transcriptomes from non-model organisms.

    PubMed

    Speiser, Daniel I; Pankey, M Sabrina; Zaharoff, Alexander K; Battelle, Barbara A; Bracken-Grissom, Heather D; Breinholt, Jesse W; Bybee, Seth M; Cronin, Thomas W; Garm, Anders; Lindgren, Annie R; Patel, Nipam H; Porter, Megan L; Protas, Meredith E; Rivera, Ajna S; Serb, Jeanne M; Zigler, Kirk S; Crandall, Keith A; Oakley, Todd H

    2014-11-19

    Tools for high throughput sequencing and de novo assembly make the analysis of transcriptomes (i.e. the suite of genes expressed in a tissue) feasible for almost any organism. Yet a challenge for biologists is that it can be difficult to assign identities to gene sequences, especially from non-model organisms. Phylogenetic analyses are one useful method for assigning identities to these sequences, but such methods tend to be time-consuming because of the need to re-calculate trees for every gene of interest and each time a new data set is analyzed. In response, we employed existing tools for phylogenetic analysis to produce a computationally efficient, tree-based approach for annotating transcriptomes or new genomes that we term Phylogenetically-Informed Annotation (PIA), which places uncharacterized genes into pre-calculated phylogenies of gene families. We generated maximum likelihood trees for 109 genes from a Light Interaction Toolkit (LIT), a collection of genes that underlie the function or development of light-interacting structures in metazoans. To do so, we searched protein sequences predicted from 29 fully-sequenced genomes and built trees using tools for phylogenetic analysis in the Osiris package of Galaxy (an open-source workflow management system). Next, to rapidly annotate transcriptomes from organisms that lack sequenced genomes, we repurposed a maximum likelihood-based Evolutionary Placement Algorithm (implemented in RAxML) to place sequences of potential LIT genes on to our pre-calculated gene trees. Finally, we implemented PIA in Galaxy and used it to search for LIT genes in 28 newly-sequenced transcriptomes from the light-interacting tissues of a range of cephalopod mollusks, arthropods, and cubozoan cnidarians. Our new trees for LIT genes are available on the Bitbucket public repository ( http://bitbucket.org/osiris_phylogenetics/pia/ ) and we demonstrate PIA on a publicly-accessible web server ( http://galaxy-dev.cnsi.ucsb.edu/pia/ ). Our new trees for LIT genes will be a valuable resource for researchers studying the evolution of eyes or other light-interacting structures. We also introduce PIA, a high throughput method for using phylogenetic relationships to identify LIT genes in transcriptomes from non-model organisms. With simple modifications, our methods may be used to search for different sets of genes or to annotate data sets from taxa outside of Metazoa.

  15. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  16. A guide to best practices for Gene Ontology (GO) manual annotation

    PubMed Central

    Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael

    2013-01-01

    The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463

  17. Gene Ontology annotations at SGD: new data sources and annotation methods

    PubMed Central

    Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.; Nash, Robert S.; Oughtred, Rose; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Zhu, Kathy K.; Dolinski, Kara; Botstein, David; Cherry, J. Michael

    2008-01-01

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current. PMID:17982175

  18. ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data

    PubMed Central

    2010-01-01

    Background Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) or ChIP followed by genome tiling array analysis (ChIP-chip) have become standard technologies for genome-wide identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that allow identification of binding sites from ChIP-seq or ChIP-chip datasets and subsequent visualization in the University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are distributed widely across the genome. Results We have developed ChIPpeakAnno as a Bioconductor package within the statistical programming environment R to facilitate batch annotation of enriched peaks identified from ChIP-seq, ChIP-chip, cap analysis of gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding sites annotated with ChIPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented functionalities for determining the significance of overlap between replicates or binding sites among transcription factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent genes. Conclusions ChIPpeakAnno enables batch annotation of the binding sites identified from ChIP-seq, ChIP-chip, CAGE or any technology that results in a large number of enriched genomic regions within the statistical programming environment R. Allowing users to pass their own annotation data such as a different Chromatin immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date annotation retrieval from the BioMart database. PMID:20459804

  19. A curated catalog of canine and equine keratin genes

    PubMed Central

    Pujar, Shashikant; McGarvey, Kelly M.; Welle, Monika; Galichet, Arnaud; Müller, Eliane J.; Pruitt, Kim D.; Leeb, Tosso

    2017-01-01

    Keratins represent a large protein family with essential structural and functional roles in epithelial cells of skin, hair follicles, and other organs. During evolution the genes encoding keratins have undergone multiple rounds of duplication and humans have two clusters with a total of 55 functional keratin genes in their genomes. Due to the high similarity between different keratin paralogs and species-specific differences in gene content, the currently available keratin gene annotation in species with draft genome assemblies such as dog and horse is still imperfect. We compared the National Center for Biotechnology Information (NCBI) (dog annotation release 103, horse annotation release 101) and Ensembl (release 87) gene predictions for the canine and equine keratin gene clusters to RNA-seq data that were generated from adult skin of five dogs and two horses and from adult hair follicle tissue of one dog. Taking into consideration the knowledge on the conserved exon/intron structure of keratin genes, we annotated 61 putatively functional keratin genes in both the dog and horse, respectively. Subsequently, curators in the RefSeq group at NCBI reviewed their annotation of keratin genes in the dog and horse genomes (Annotation Release 104 and Annotation Release 102, respectively) and updated annotation and gene nomenclature of several keratin genes. The updates are now available in the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene). PMID:28846680

  20. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  1. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.

    PubMed

    Law, MeiYee; Childs, Kevin L; Campbell, Michael S; Stein, Joshua C; Olson, Andrew J; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. © 2015 American Society of Plant Biologists. All Rights Reserved.

  2. EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

    PubMed Central

    Thibaud-Nissen, Françoise; Campbell, Matthew; Hamilton, John P; Zhu, Wei; Buell, C Robin

    2007-01-01

    Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at . PMID:17961238

  3. GeneTools--application for functional annotation and statistical hypothesis testing.

    PubMed

    Beisvag, Vidar; Jünge, Frode K R; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Laegreid, Astrid

    2006-10-24

    Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no

  4. AGORA : Organellar genome annotation from the amino acid and nucleotide references.

    PubMed

    Jung, Jaehee; Kim, Jong Im; Jeong, Young-Sik; Yi, Gangman

    2018-03-29

    Next-generation sequencing (NGS) technologies have led to the accumulation of highthroughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals.We have developed a web application AGORA for the fast, user-friendly, and improved annotations of organellar genomes. AGORA annotates genes based on a BLAST-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. AGORA can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon-intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence, and visualization of gene map by OGDRAW. Users can freely use the software, and the accessible URL is https://bigdata.dongguk.edu/gene_project/AGORA/.The main module of the tool is implemented by the python and php, and the web page is built by the HTML and CSS to support all browsers. gangman@dongguk.edu.

  5. Methodology for the inference of gene function from phenotype data.

    PubMed

    Ascensao, Joao A; Dolan, Mary E; Hill, David P; Blake, Judith A

    2014-12-12

    Biomedical ontologies are increasingly instrumental in the advancement of biological research primarily through their use to efficiently consolidate large amounts of data into structured, accessible sets. However, ontology development and usage can be hampered by the segregation of knowledge by domain that occurs due to independent development and use of the ontologies. The ability to infer data associated with one ontology to data associated with another ontology would prove useful in expanding information content and scope. We here focus on relating two ontologies: the Gene Ontology (GO), which encodes canonical gene function, and the Mammalian Phenotype Ontology (MP), which describes non-canonical phenotypes, using statistical methods to suggest GO functional annotations from existing MP phenotype annotations. This work is in contrast to previous studies that have focused on inferring gene function from phenotype primarily through lexical or semantic similarity measures. We have designed and tested a set of algorithms that represents a novel methodology to define rules for predicting gene function by examining the emergent structure and relationships between the gene functions and phenotypes rather than inspecting the terms semantically. The algorithms inspect relationships among multiple phenotype terms to deduce if there are cases where they all arise from a single gene function. We apply this methodology to data about genes in the laboratory mouse that are formally represented in the Mouse Genome Informatics (MGI) resource. From the data, 7444 rule instances were generated from five generalized rules, resulting in 4818 unique GO functional predictions for 1796 genes. We show that our method is capable of inferring high-quality functional annotations from curated phenotype data. As well as creating inferred annotations, our method has the potential to allow for the elucidation of unforeseen, biologically significant associations between gene function and phenotypes that would be overlooked by a semantics-based approach. Future work will include the implementation of the described algorithms for a variety of other model organism databases, taking full advantage of the abundance of available high quality curated data.

  6. Pathway enrichment analysis approach based on topological structure and updated annotation of pathway.

    PubMed

    Yang, Qian; Wang, Shuyuan; Dai, Enyu; Zhou, Shunheng; Liu, Dianming; Liu, Haizhou; Meng, Qianqian; Jiang, Bin; Jiang, Wei

    2017-08-16

    Pathway enrichment analysis has been widely used to identify cancer risk pathways, and contributes to elucidating the mechanism of tumorigenesis. However, most of the existing approaches use the outdated pathway information and neglect the complex gene interactions in pathway. Here, we first reviewed the existing widely used pathway enrichment analysis approaches briefly, and then, we proposed a novel topology-based pathway enrichment analysis (TPEA) method, which integrated topological properties and global upstream/downstream positions of genes in pathways. We compared TPEA with four widely used pathway enrichment analysis tools, including database for annotation, visualization and integrated discovery (DAVID), gene set enrichment analysis (GSEA), centrality-based pathway enrichment (CePa) and signaling pathway impact analysis (SPIA), through analyzing six gene expression profiles of three tumor types (colorectal cancer, thyroid cancer and endometrial cancer). As a result, we identified several well-known cancer risk pathways that could not be obtained by the existing tools, and the results of TPEA were more stable than that of the other tools in analyzing different data sets of the same cancer. Ultimately, we developed an R package to implement TPEA, which could online update KEGG pathway information and is available at the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/TPEA/. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  7. Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions

    PubMed Central

    Tataru, Paula; Sand, Andreas; Hobolth, Asger; Mailund, Thomas; Pedersen, Christian N. S.

    2013-01-01

    Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model. PMID:24833225

  8. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    PubMed Central

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  9. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes.

    PubMed

    Lowe, Todd M; Chan, Patricia P

    2016-07-08

    High-throughput genome sequencing continues to grow the need for rapid, accurate genome annotation and tRNA genes constitute the largest family of essential, ever-present non-coding RNA genes. Newly developed tRNAscan-SE 2.0 has advanced the state-of-the-art methodology in tRNA gene detection and functional prediction, captured by rich new content of the companion Genomic tRNA Database. Previously, web-server tRNA detection was isolated from knowledge of existing tRNAs and their annotation. In this update of the tRNAscan-SE On-line resource, we tie together improvements in tRNA classification with greatly enhanced biological context via dynamically generated links between web server search results, the most relevant genes in the GtRNAdb and interactive, rich genome context provided by UCSC genome browsers. The tRNAscan-SE On-line web server can be accessed at http://trna.ucsc.edu/tRNAscan-SE/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. Linking microarray reporters with protein functions

    PubMed Central

    Gaj, Stan; van Erk, Arie; van Haaften, Rachel IM; Evelo, Chris TA

    2007-01-01

    Background The analysis of microarray experiments requires accurate and up-to-date functional annotation of the microarray reporters to optimize the interpretation of the biological processes involved. Pathway visualization tools are used to connect gene expression data with existing biological pathways by using specific database identifiers that link reporters with elements in the pathways. Results This paper proposes a novel method that aims to improve microarray reporter annotation by BLASTing the original reporter sequences against a species-specific EMBL subset, that was derived from and crosslinked back to the highly curated UniProt database. The resulting alignments were filtered using high quality alignment criteria and further compared with the outcome of a more traditional approach, where reporter sequences were BLASTed against EnsEMBL followed by locating the corresponding protein (UniProt) entry for the high quality hits. Combining the results of both methods resulted in successful annotation of > 58% of all reporter sequences with UniProt IDs on two commercial array platforms, increasing the amount of Incyte reporters that could be coupled to Gene Ontology terms from 32.7% to 58.3% and to a local GenMAPP pathway from 9.6% to 16.7%. For Agilent, 35.3% of the total reporters are now linked towards GO nodes and 7.1% on local pathways. Conclusion Our methods increased the annotation quality of microarray reporter sequences and allowed us to visualize more reporters using pathway visualization tools. Even in cases where the original reporter annotation showed the correct description the new identifiers often allowed improved pathway and Gene Ontology linking. These methods are freely available at http://www.bigcat.unimaas.nl/public/publications/Gaj_Annotation/. PMID:17897448

  11. Considerations to improve functional annotations in biological databases.

    PubMed

    Benítez-Páez, Alfonso

    2009-12-01

    Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community.

  12. The Pathway Coexpression Network: Revealing pathway relationships

    PubMed Central

    Tanzi, Rudolph E.

    2018-01-01

    A goal of genomics is to understand the relationships between biological processes. Pathways contribute to functional interplay within biological processes through complex but poorly understood interactions. However, limited functional references for global pathway relationships exist. Pathways from databases such as KEGG and Reactome provide discrete annotations of biological processes. Their relationships are currently either inferred from gene set enrichment within specific experiments, or by simple overlap, linking pathway annotations that have genes in common. Here, we provide a unifying interpretation of functional interaction between pathways by systematically quantifying coexpression between 1,330 canonical pathways from the Molecular Signatures Database (MSigDB) to establish the Pathway Coexpression Network (PCxN). We estimated the correlation between canonical pathways valid in a broad context using a curated collection of 3,207 microarrays from 72 normal human tissues. PCxN accounts for shared genes between annotations to estimate significant correlations between pathways with related functions rather than with similar annotations. We demonstrate that PCxN provides novel insight into mechanisms of complex diseases using an Alzheimer’s Disease (AD) case study. PCxN retrieved pathways significantly correlated with an expert curated AD gene list. These pathways have known associations with AD and were significantly enriched for genes independently associated with AD. As a further step, we show how PCxN complements the results of gene set enrichment methods by revealing relationships between enriched pathways, and by identifying additional highly correlated pathways. PCxN revealed that correlated pathways from an AD expression profiling study include functional clusters involved in cell adhesion and oxidative stress. PCxN provides expanded connections to pathways from the extracellular matrix. PCxN provides a powerful new framework for interrogation of global pathway relationships. Comprehensive exploration of PCxN can be performed at http://pcxn.org/. PMID:29554099

  13. IsoSCM: improved and alternative 3′ UTR annotation using multiple change-point inference

    PubMed Central

    Shenker, Sol; Miura, Pedro; Sanfilippo, Piero

    2015-01-01

    Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3′ UTRs. Existing assembly strategies often fragment long 3′ UTRs, and importantly, none of the algorithms in popular use can apportion data into tandem 3′ UTR isoforms, which are frequently generated by alternative cleavage and polyadenylation (APA). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools. To address these limitations, we present a new method for transcript assembly, Isoform Structural Change Model (IsoSCM) that incorporates change-point analysis to improve the 3′ UTR annotation process. Through evaluation on simulated and genuine data sets, we demonstrate that IsoSCM annotates 3′ termini with higher sensitivity and specificity than can be achieved with existing methods. We highlight the utility of IsoSCM by demonstrating its ability to recover known patterns of tissue-regulated APA. IsoSCM will facilitate future efforts for 3′ UTR annotation and genome-wide studies of the breadth, regulation, and roles of APA leveraging RNA-seq data. The IsoSCM software and source code are available from our website https://github.com/shenkers/isoscm. PMID:25406361

  14. JGI Plant Genomics Gene Annotation Pipeline

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward thismore » aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.« less

  15. RNA Sequencing-Based Genome Reannotation of the Dermatophyte Arthroderma benhamiae and Characterization of Its Secretome and Whole Gene Expression Profile during Infection

    PubMed Central

    De Coi, Niccolò; Feuermann, Marc; Schmid-Siegert, Emanuel; Băguţ, Elena-Tatiana; Mignon, Bernard; Waridel, Patrice; Peter, Corinne; Pradervand, Sylvain

    2016-01-01

    ABSTRACT Dermatophytes are the most common agents of superficial mycoses in humans and animals. The aim of the present investigation was to systematically identify the extracellular, possibly secreted, proteins that are putative virulence factors and antigenic molecules of dermatophytes. A complete gene expression profile of Arthroderma benhamiae was obtained during infection of its natural host (guinea pig) using RNA sequencing (RNA-seq) technology. This profile was completed with those of the fungus cultivated in vitro in two media containing either keratin or soy meal protein as the sole source of nitrogen and in Sabouraud medium. More than 60% of transcripts deduced from RNA-seq data differ from those previously deposited for A. benhamiae. Using these RNA-seq data along with an automatic gene annotation procedure, followed by manual curation, we produced a new annotation of the A. benhamiae genome. This annotation comprised 7,405 coding sequences (CDSs), among which only 2,662 were identical to the currently available annotation, 383 were newly identified, and 15 secreted proteins were manually corrected. The expression profile of genes encoding proteins with a signal peptide in infected guinea pigs was found to be very different from that during in vitro growth when using keratin as the substrate. Especially, the sets of the 12 most highly expressed genes encoding proteases with a signal sequence had only the putative vacuolar aspartic protease gene PEP2 in common, during infection and in keratin medium. The most upregulated gene encoding a secreted protease during infection was that encoding subtilisin SUB6, which is a known major allergen in the related dermatophyte Trichophyton rubrum. IMPORTANCE Dermatophytoses (ringworm, jock itch, athlete’s foot, and nail infections) are the most common fungal infections, but their virulence mechanisms are poorly understood. Combining transcriptomic data obtained from growth under various culture conditions with data obtained during infection led to a significantly improved genome annotation. About 65% of the protein-encoding genes predicted with our protocol did not match the existing annotation for A. benhamiae. Comparing gene expression during infection on guinea pigs with keratin degradation in vitro, which is supposed to mimic the host environment, revealed the critical importance of using real in vivo conditions for investigating virulence mechanisms. The analysis of genes expressed in vivo, encoding cell surface and secreted proteins, particularly proteases, led to the identification of new allergen and virulence factor candidates. PMID:27822542

  16. RNA Sequencing-Based Genome Reannotation of the Dermatophyte Arthroderma benhamiae and Characterization of Its Secretome and Whole Gene Expression Profile during Infection.

    PubMed

    Tran, Van Du T; De Coi, Niccolò; Feuermann, Marc; Schmid-Siegert, Emanuel; Băguţ, Elena-Tatiana; Mignon, Bernard; Waridel, Patrice; Peter, Corinne; Pradervand, Sylvain; Pagni, Marco; Monod, Michel

    2016-01-01

    Dermatophytes are the most common agents of superficial mycoses in humans and animals. The aim of the present investigation was to systematically identify the extracellular, possibly secreted, proteins that are putative virulence factors and antigenic molecules of dermatophytes. A complete gene expression profile of Arthroderma benhamiae was obtained during infection of its natural host (guinea pig) using RNA sequencing (RNA-seq) technology. This profile was completed with those of the fungus cultivated in vitro in two media containing either keratin or soy meal protein as the sole source of nitrogen and in Sabouraud medium. More than 60% of transcripts deduced from RNA-seq data differ from those previously deposited for A. benhamiae . Using these RNA-seq data along with an automatic gene annotation procedure, followed by manual curation, we produced a new annotation of the A. benhamiae genome. This annotation comprised 7,405 coding sequences (CDSs), among which only 2,662 were identical to the currently available annotation, 383 were newly identified, and 15 secreted proteins were manually corrected. The expression profile of genes encoding proteins with a signal peptide in infected guinea pigs was found to be very different from that during in vitro growth when using keratin as the substrate. Especially, the sets of the 12 most highly expressed genes encoding proteases with a signal sequence had only the putative vacuolar aspartic protease gene PEP2 in common, during infection and in keratin medium. The most upregulated gene encoding a secreted protease during infection was that encoding subtilisin SUB6, which is a known major allergen in the related dermatophyte Trichophyton rubrum . IMPORTANCE Dermatophytoses (ringworm, jock itch, athlete's foot, and nail infections) are the most common fungal infections, but their virulence mechanisms are poorly understood. Combining transcriptomic data obtained from growth under various culture conditions with data obtained during infection led to a significantly improved genome annotation. About 65% of the protein-encoding genes predicted with our protocol did not match the existing annotation for A. benhamiae . Comparing gene expression during infection on guinea pigs with keratin degradation in vitro , which is supposed to mimic the host environment, revealed the critical importance of using real in vivo conditions for investigating virulence mechanisms. The analysis of genes expressed in vivo , encoding cell surface and secreted proteins, particularly proteases, led to the identification of new allergen and virulence factor candidates.

  17. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

    PubMed Central

    Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

    2013-01-01

    The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284

  18. PIGD: a database for intronless genes in the Poaceae.

    PubMed

    Yan, Hanwei; Jiang, Cuiping; Li, Xiaoyu; Sheng, Lei; Dong, Qing; Peng, Xiaojian; Li, Qian; Zhao, Yang; Jiang, Haiyang; Cheng, Beijiu

    2014-10-01

    Intronless genes are a feature of prokaryotes; however, they are widespread and unequally distributed among eukaryotes and represent an important resource to study the evolution of gene architecture. Although many databases on exons and introns exist, there is currently no cohesive database that collects intronless genes in plants into a single database. In this study, we present the Poaceae Intronless Genes Database (PIGD), a user-friendly web interface to explore information on intronless genes from different plants. Five Poaceae species, Sorghum bicolor, Zea mays, Setaria italica, Panicum virgatum and Brachypodium distachyon, are included in the current release of PIGD. Gene annotations and sequence data were collected and integrated from different databases. The primary focus of this study was to provide gene descriptions and gene product records. In addition, functional annotations, subcellular localization prediction and taxonomic distribution are reported. PIGD allows users to readily browse, search and download data. BLAST and comparative analyses are also provided through this online database, which is available at http://pigd.ahau.edu.cn/. PIGD provides a solid platform for the collection, integration and analysis of intronless genes in the Poaceae. As such, this database will be useful for subsequent bio-computational analysis in comparative genomics and evolutionary studies.

  19. Recognition of Protein-coding Genes Based on Z-curve Algorithms

    PubMed Central

    -Biao Guo, Feng; Lin, Yan; -Ling Chen, Ling

    2014-01-01

    Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation. PMID:24822027

  20. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments

    PubMed Central

    Haas, Brian J; Salzberg, Steven L; Zhu, Wei; Pertea, Mihaela; Allen, Jonathan E; Orvis, Joshua; White, Owen; Buell, C Robin; Wortman, Jennifer R

    2008-01-01

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation. PMID:18190707

  1. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

    PubMed Central

    2010-01-01

    Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105

  2. Redefining the genetics of Murine Gammaherpesvirus 68 via transcriptome-based annotation

    PubMed Central

    Johnson, L. Steven; Willert, Erin K.; Virgin, Herbert W.

    2010-01-01

    Summary Viral genetic studies often focus on large open reading frames (ORFs) identified during genome annotation (ORF-based annotation). Here we provide a tool and software set for defining gene expression by murine gammaherpesvirus 68 (γHV68) nucleotide-by-nucleotide across the 119,450 basepair (bp) genome. These tools allowed us to determine that viral RNA expression was significantly more complex than predicted from ORF-based annotation, including over 73,000 nucleotides of unexpected transcription within 30 expressed genomic regions (EGRs). Approximately 90% of this RNA expression was antisense to genomic regions containing known large ORFs. We verified the existence of novel transcripts in three EGRs using standard methods to validate the approach and determined which parts of the transcriptome depend on protein or viral DNA synthesis. This redefines the genetic map of γHV68, indicates that herpesviruses contain significantly more genetic complexity than predicted from ORF-based genome annotations, and provides new tools and approaches for viral genetic studies. PMID:20542255

  3. Comparative Life Cycle Transcriptomics Revises Leishmania mexicana Genome Annotation and Links a Chromosome Duplication with Parasitism of Vertebrates

    PubMed Central

    Fiebig, Michael; Kelly, Steven; Gluenz, Eva

    2015-01-01

    Leishmania spp. are protozoan parasites that have two principal life cycle stages: the motile promastigote forms that live in the alimentary tract of the sandfly and the amastigote forms, which are adapted to survive and replicate in the harsh conditions of the phagolysosome of mammalian macrophages. Here, we used Illumina sequencing of poly-A selected RNA to characterise and compare the transcriptomes of L. mexicana promastigotes, axenic amastigotes and intracellular amastigotes. These data allowed the production of the first transcriptome evidence-based annotation of gene models for this species, including genome-wide mapping of trans-splice sites and poly-A addition sites. The revised genome annotation encompassed 9,169 protein-coding genes including 936 novel genes as well as modifications to previously existing gene models. Comparative analysis of gene expression across promastigote and amastigote forms revealed that 3,832 genes are differentially expressed between promastigotes and intracellular amastigotes. A large proportion of genes that were downregulated during differentiation to amastigotes were associated with the function of the motile flagellum. In contrast, those genes that were upregulated included cell surface proteins, transporters, peptidases and many uncharacterized genes, including 293 of the 936 novel genes. Genome-wide distribution analysis of the differentially expressed genes revealed that the tetraploid chromosome 30 is highly enriched for genes that were upregulated in amastigotes, providing the first evidence of a link between this whole chromosome duplication event and adaptation to the vertebrate host in this group. Peptide evidence for 42 proteins encoded by novel transcripts supports the idea of an as yet uncharacterised set of small proteins in Leishmania spp. with possible implications for host-pathogen interactions. PMID:26452044

  4. GeneXplorer: an interactive web application for microarray data visualization and analysis.

    PubMed

    Rees, Christian A; Demeter, Janos; Matese, John C; Botstein, David; Sherlock, Gavin

    2004-10-01

    When publishing large-scale microarray datasets, it is of great value to create supplemental websites where either the full data, or selected subsets corresponding to figures within the paper, can be browsed. We set out to create a CGI application containing many of the features of some of the existing standalone software for the visualization of clustered microarray data. We present GeneXplorer, a web application for interactive microarray data visualization and analysis in a web environment. GeneXplorer allows users to browse a microarray dataset in an intuitive fashion. It provides simple access to microarray data over the Internet and uses only HTML and JavaScript to display graphic and annotation information. It provides radar and zoom views of the data, allows display of the nearest neighbors to a gene expression vector based on their Pearson correlations and provides the ability to search gene annotation fields. The software is released under the permissive MIT Open Source license, and the complete documentation and the entire source code are freely available for download from CPAN http://search.cpan.org/dist/Microarray-GeneXplorer/.

  5. The aquatic animals' transcriptome resource for comparative functional analysis.

    PubMed

    Chou, Chih-Hung; Huang, Hsi-Yuan; Huang, Wei-Chih; Hsu, Sheng-Da; Hsiao, Chung-Der; Liu, Chia-Yu; Chen, Yu-Hung; Liu, Yu-Chen; Huang, Wei-Yun; Lee, Meng-Lin; Chen, Yi-Chang; Huang, Hsien-Da

    2018-05-09

    Aquatic animals have great economic and ecological importance. Among them, non-model organisms have been studied regarding eco-toxicity, stress biology, and environmental adaptation. Due to recent advances in next-generation sequencing techniques, large amounts of RNA-seq data for aquatic animals are publicly available. However, currently there is no comprehensive resource exist for the analysis, unification, and integration of these datasets. This study utilizes computational approaches to build a new resource of transcriptomic maps for aquatic animals. This aquatic animal transcriptome map database dbATM provides de novo assembly of transcriptome, gene annotation and comparative analysis of more than twenty aquatic organisms without draft genome. To improve the assembly quality, three computational tools (Trinity, Oases and SOAPdenovo-Trans) were employed to enhance individual transcriptome assembly, and CAP3 and CD-HIT-EST software were then used to merge these three assembled transcriptomes. In addition, functional annotation analysis provides valuable clues to gene characteristics, including full-length transcript coding regions, conserved domains, gene ontology and KEGG pathways. Furthermore, all aquatic animal genes are essential for comparative genomics tasks such as constructing homologous gene groups and blast databases and phylogenetic analysis. In conclusion, we establish a resource for non model organism aquatic animals, which is great economic and ecological importance and provide transcriptomic information including functional annotation and comparative transcriptome analysis. The database is now publically accessible through the URL http://dbATM.mbc.nctu.edu.tw/ .

  6. Integrative Functional Genomics for Systems Genetics in GeneWeaver.org.

    PubMed

    Bubier, Jason A; Langston, Michael A; Baker, Erich J; Chesler, Elissa J

    2017-01-01

    The abundance of existing functional genomics studies permits an integrative approach to interpreting and resolving the results of diverse systems genetics studies. However, a major challenge lies in assembling and harmonizing heterogeneous data sets across species for facile comparison to the positional candidate genes and coexpression networks that come from systems genetic studies. GeneWeaver is an online database and suite of tools at www.geneweaver.org that allows for fast aggregation and analysis of gene set-centric data. GeneWeaver contains curated experimental data together with resource-level data such as GO annotations, MP annotations, and KEGG pathways, along with persistent stores of user entered data sets. These can be entered directly into GeneWeaver or transferred from widely used resources such as GeneNetwork.org. Data are analyzed using statistical tools and advanced graph algorithms to discover new relations, prioritize candidate genes, and generate function hypotheses. Here we use GeneWeaver to find genes common to multiple gene sets, prioritize candidate genes from a quantitative trait locus, and characterize a set of differentially expressed genes. Coupling a large multispecies repository curated and empirical functional genomics data to fast computational tools allows for the rapid integrative analysis of heterogeneous data for interpreting and extrapolating systems genetics results.

  7. A Genome-Wide Identification of Basic Helix-Loop-Helix Motifs in Pediculus humanus corporis (Phthiraptera: Pediculidae)

    PubMed Central

    Wang, Xu-Hua; Wang, Yong; Zhang, De-Bao; Liu, A-Ke; Yao, Qin; Chen, Ke-Ping

    2014-01-01

    Abstract Basic helix-loop-helix (bHLH) proteins comprise a large superfamily of transcription factors, which are involved in the regulation of various developmental processes. bHLH family members are widely distributed in various eukaryotes including yeast, fruit fly, zebrafish, mouse, and human. In this study, we identified 55 bHLH motifs encoded in genome sequence of the human body louse, Pediculus humanus corporis (Phthiraptera: Pediculidae). Phylogenetic analyses of the identified P. humanus corporis bHLH (PhcbHLH) motifs revealed that there are 23, 11, 9, 1, 10, and 1 member(s) in groups A, B, C, D, E, and F, respectively. Examination to GenBank annotations of the 55 PhcbHLH members indicated that 29 PhcbHLH proteins were annotated in consistence with our analytical result, 8 were annotated different with our analytical result, 12 were merely annotated as hypothetical protein, and the rest 6 were not deposited in GenBank. A comparison on insect bHLH gene composition revealed that human body louse possibly has more hairy and E(spl) genes than other insect species. Because hairy and E(spl) genes have been found to negatively regulate the differentiation of insect preneural cells, it is suggested that the existence of additional hairy and E(spl) genes in human body louse is probably the consequence of its long period adaptation to the relatively dark and stable environment. These data provide good references for further studies on regulatory functions of bHLH proteins in the growth and development of human body louse. PMID:25434030

  8. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGES

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; ...

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  9. GONUTS: the Gene Ontology Normal Usage Tracking System

    PubMed Central

    Renfro, Daniel P.; McIntosh, Brenley K.; Venkatraman, Anand; Siegele, Deborah A.; Hu, James C.

    2012-01-01

    The Gene Ontology Normal Usage Tracking System (GONUTS) is a community-based browser and usage guide for Gene Ontology (GO) terms and a community system for general GO annotation of proteins. GONUTS uses wiki technology to allow registered users to share and edit notes on the use of each term in GO, and to contribute annotations for specific genes of interest. By providing a site for generation of third-party documentation at the granularity of individual terms, GONUTS complements the official documentation of the Gene Ontology Consortium. To provide examples for community users, GONUTS displays the complete GO annotations from seven model organisms: Saccharomyces cerevisiae, Dictyostelium discoideum, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus and Arabidopsis thaliana. To support community annotation, GONUTS allows automated creation of gene pages for gene products in UniProt. GONUTS will improve the consistency of annotation efforts across genome projects, and should be useful in training new annotators and consumers in the production of GO annotations and the use of GO terms. GONUTS can be accessed at http://gowiki.tamu.edu. The source code for generating the content of GONUTS is available upon request. PMID:22110029

  10. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

    PubMed

    Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P

    2015-03-11

    The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

  11. Mycobacteriophage genome database.

    PubMed

    Joseph, Jerrine; Rajendran, Vasanthi; Hassan, Sameer; Kumar, Vanaja

    2011-01-01

    Mycobacteriophage genome database (MGDB) is an exclusive repository of the 64 completely sequenced mycobacteriophages with annotated information. It is a comprehensive compilation of the various gene parameters captured from several databases pooled together to empower mycobacteriophage researchers. The MGDB (Version No.1.0) comprises of 6086 genes from 64 mycobacteriophages classified into 72 families based on ACLAME database. Manual curation was aided by information available from public databases which was enriched further by analysis. Its web interface allows browsing as well as querying the classification. The main objective is to collect and organize the complexity inherent to mycobacteriophage protein classification in a rational way. The other objective is to browse the existing and new genomes and describe their functional annotation. The database is available for free at http://mpgdb.ibioinformatics.org/mpgdb.php.

  12. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  13. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  14. Metagenomic gene annotation by a homology-independent approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Froula, Jeff; Zhang, Tao; Salmeen, Annette

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMERmore » but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.« less

  15. Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome

    PubMed Central

    Hücker, Sarah M.; Ardern, Zachary; Goldberg, Tatyana; Schafferhans, Andrea; Bernhofer, Michael; Vestergaard, Gisle; Nelson, Chase W.; Schloter, Michael; Rost, Burkhard; Scherer, Siegfried

    2017-01-01

    In the past, short protein-coding genes were often disregarded by genome annotation pipelines. Transcriptome sequencing (RNAseq) signals outside of annotated genes have usually been interpreted to indicate either ncRNA or pervasive transcription. Therefore, in addition to the transcriptome, the translatome (RIBOseq) of the enteric pathogen Escherichia coli O157:H7 strain Sakai was determined at two optimal growth conditions and a severe stress condition combining low temperature and high osmotic pressure. All intergenic open reading frames potentially encoding a protein of ≥ 30 amino acids were investigated with regard to coverage by transcription and translation signals and their translatability expressed by the ribosomal coverage value. This led to discovery of 465 unique, putative novel genes not yet annotated in this E. coli strain, which are evenly distributed over both DNA strands of the genome. For 255 of the novel genes, annotated homologs in other bacteria were found, and a machine-learning algorithm, trained on small protein-coding E. coli genes, predicted that 89% of these translated open reading frames represent bona fide genes. The remaining 210 putative novel genes without annotated homologs were compared to the 255 novel genes with homologs and to 250 short annotated genes of this E. coli strain. All three groups turned out to be similar with respect to their translatability distribution, fractions of differentially regulated genes, secondary structure composition, and the distribution of evolutionary constraint, suggesting that both novel groups represent legitimate genes. However, the machine-learning algorithm only recognized a small fraction of the 210 genes without annotated homologs. It is possible that these genes represent a novel group of genes, which have unusual features dissimilar to the genes of the machine-learning algorithm training set. PMID:28902868

  16. Mutant phenotypes for thousands of bacterial genes of unknown function

    DOE PAGES

    Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan; ...

    2018-05-16

    One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less

  17. Mutant phenotypes for thousands of bacterial genes of unknown function

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan

    One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less

  18. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    PubMed Central

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  19. Analysis of disease-associated objects at the Rat Genome Database

    PubMed Central

    Wang, Shur-Jen; Laulederkind, Stanley J. F.; Hayman, G. T.; Smith, Jennifer R.; Petri, Victoria; Lowry, Timothy F.; Nigam, Rajni; Dwinell, Melinda R.; Worthey, Elizabeth A.; Munzenmaier, Diane H.; Shimoyama, Mary; Jacob, Howard J.

    2013-01-01

    The Rat Genome Database (RGD) is the premier resource for genetic, genomic and phenotype data for the laboratory rat, Rattus norvegicus. In addition to organizing biological data from rats, the RGD team focuses on manual curation of gene–disease associations for rat, human and mouse. In this work, we have analyzed disease-associated strains, quantitative trait loci (QTL) and genes from rats. These disease objects form the basis for seven disease portals. Among disease portals, the cardiovascular disease and obesity/metabolic syndrome portals have the highest number of rat strains and QTL. These two portals share 398 rat QTL, and these shared QTL are highly concentrated on rat chromosomes 1 and 2. For disease-associated genes, we performed gene ontology (GO) enrichment analysis across portals using RatMine enrichment widgets. Fifteen GO terms, five from each GO aspect, were selected to profile enrichment patterns of each portal. Of the selected biological process (BP) terms, ‘regulation of programmed cell death’ was the top enriched term across all disease portals except in the obesity/metabolic syndrome portal where ‘lipid metabolic process’ was the most enriched term. ‘Cytosol’ and ‘nucleus’ were common cellular component (CC) annotations for disease genes, but only the cancer portal genes were highly enriched with ‘nucleus’ annotations. Similar enrichment patterns were observed in a parallel analysis using the DAVID functional annotation tool. The relationship between the preselected 15 GO terms and disease terms was examined reciprocally by retrieving rat genes annotated with these preselected terms. The individual GO term–annotated gene list showed enrichment in physiologically related diseases. For example, the ‘regulation of blood pressure’ genes were enriched with cardiovascular disease annotations, and the ‘lipid metabolic process’ genes with obesity annotations. Furthermore, we were able to enhance enrichment of neurological diseases by combining ‘G-protein coupled receptor binding’ annotated genes with ‘protein kinase binding’ annotated genes. Database URL: http://rgd.mcw.edu PMID:23794737

  20. Unique core genomes of the bacterial family vibrionaceae: insights into niche adaptation and speciation.

    PubMed

    Kahlke, Tim; Goesmann, Alexander; Hjerde, Erik; Willassen, Nils Peder; Haugen, Peik

    2012-05-10

    The criteria for defining bacterial species and even the concept of bacterial species itself are under debate, and the discussion is apparently intensifying as more genome sequence data is becoming available. However, it is still unclear how the new advances in genomics should be used most efficiently to address this question. In this study we identify genes that are common to any group of genomes in our dataset, to determine whether genes specific to a particular taxon exist and to investigate their potential role in adaptation of bacteria to their specific niche. These genes were named unique core genes. Additionally, we investigate the existence and importance of unique core genes that are found in isolates of phylogenetically non-coherent groups. These groups of isolates, that share a genetic feature without sharing a closest common ancestor, are termed genophyletic groups. The bacterial family Vibrionaceae was used as the model, and we compiled and compared genome sequences of 64 different isolates. Using the software orthoMCL we determined clusters of homologous genes among the investigated genome sequences. We used multilocus sequence analysis to build a host phylogeny and mapped the numbers of unique core genes of all distinct groups of isolates onto the tree. The results show that unique core genes are more likely to be found in monophyletic groups of isolates. Genophyletic groups of isolates, in contrast, are less common especially for large groups of isolate. The subsequent annotation of unique core genes that are present in genophyletic groups indicate a high degree of horizontally transferred genes. Finally, the annotation of the unique core genes of Vibrio cholerae revealed genes involved in aerotaxis and biosynthesis of the iron-chelator vibriobactin. The presented work indicates that genes specific for any taxon inside the bacterial family Vibrionaceae exist. These unique core genes encode conserved metabolic functions that can shed light on the adaptation of a species to its ecological niche. Additionally, our study suggests that unique core genes can be used to aid classification of bacteria and contribute to a bacterial species definition on a genomic level. Furthermore, these genes may be of importance in clinical diagnostics and drug development.

  1. DynGO: a tool for visualizing and mining of Gene Ontology and its associations

    PubMed Central

    Liu, Hongfang; Hu, Zhang-Zhi; Wu, Cathy H

    2005-01-01

    Background A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. Results We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. Conclusion We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website . PMID:16091147

  2. WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian

    With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less

  3. HopBase: a unified resource for Humulus genomics

    PubMed Central

    Hill, Steven T.; Sudarsanam, Ramcharan

    2017-01-01

    Abstract Hop (Humulus lupulus L. var lupulus) is a dioecious plant of worldwide significance, used primarily for bittering and flavoring in brewing beer. Studies on the medicinal properties of several unique compounds produced by hop have led to additional interest from pharmacy and healthcare industries as well as livestock production as a natural antibiotic. Genomic research in hop has resulted a published draft genome and transcriptome assemblies. As research into the genomics of hop has gained interest, there is a critical need for centralized online genomic resources. To support the growing research community, we report the development of an online resource "HopBase.org." In addition to providing a gene annotation to the existing Shinsuwase draft genome, HopBase makes available genome assemblies and annotations for both the cultivar “Teamaker” and male hop accession number USDA 21422M. These genome assemblies, gene annotations, along with other common data, coupled with a genome browser and BLAST database enable the hop community to enter the genomic age. The HopBase genomic resource is accessible at http://hopbase.org and http://hopbase.cgrb.oregonstate.edu. PMID:28415075

  4. WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

    DOE PAGES

    Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian; ...

    2017-03-06

    With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less

  5. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  6. A genome-wide identification of basic helix-loop-helix motifs in Pediculus humanus corporis (Phthiraptera: Pediculidae).

    PubMed

    Wang, Xu-Hua; Wang, Yong; Zhang, De-Bao; Liu, A-Ke; Yao, Qin; Chen, Ke-Ping

    2014-01-01

    Basic helix-loop-helix (bHLH) proteins comprise a large superfamily of transcription factors, which are involved in the regulation of various developmental processes. bHLH family members are widely distributed in various eukaryotes including yeast, fruit fly, zebrafish, mouse, and human. In this study, we identified 55 bHLH motifs encoded in genome sequence of the human body louse, Pediculus humanus corporis (Phthiraptera: Pediculidae). Phylogenetic analyses of the identified P. humanus corporis bHLH (PhcbHLH) motifs revealed that there are 23, 11, 9, 1, 10, and 1 member(s) in groups A, B, C, D, E, and F, respectively. Examination to GenBank annotations of the 55 PhcbHLH members indicated that 29 PhcbHLH proteins were annotated in consistence with our analytical result, 8 were annotated different with our analytical result, 12 were merely annotated as hypothetical protein, and the rest 6 were not deposited in GenBank. A comparison on insect bHLH gene composition revealed that human body louse possibly has more hairy and E(spl) genes than other insect species. Because hairy and E(spl) genes have been found to negatively regulate the differentiation of insect preneural cells, it is suggested that the existence of additional hairy and E(spl) genes in human body louse is probably the consequence of its long period adaptation to the relatively dark and stable environment. These data provide good references for further studies on regulatory functions of bHLH proteins in the growth and development of human body louse. © The Author 2014. Published by Oxford University Press on behalf of the Entomological Society of America.

  7. Discovery of rare protein-coding genes in model methylotroph Methylobacterium extorquens AM1.

    PubMed

    Kumar, Dhirendra; Mondal, Anupam Kumar; Yadav, Amit Kumar; Dash, Debasis

    2014-12-01

    Proteogenomics involves the use of MS to refine annotation of protein-coding genes and discover genes in a genome. We carried out comprehensive proteogenomic analysis of Methylobacterium extorquens AM1 (ME-AM1) from publicly available proteomics data with a motive to improve annotation for methylotrophs; organisms capable of surviving in reduced carbon compounds such as methanol. Besides identifying 2482(50%) proteins, 29 new genes were discovered and 66 annotated gene models were revised in ME-AM1 genome. One such novel gene is identified with 75 peptides, lacks homolog in other methylobacteria but has glycosyl transferase and lipopolysaccharide biosynthesis protein domains, indicating its potential role in outer membrane synthesis. Many novel genes are present only in ME-AM1 among methylobacteria. Distant homologs of these genes in unrelated taxonomic classes and low GC-content of few genes suggest lateral gene transfer as a potential mode of their origin. Annotations of methylotrophy related genes were also improved by the discovery of a short gene in methylotrophy gene island and redefining a gene important for pyrroquinoline quinone synthesis, essential for methylotrophy. The combined use of proteogenomics and rigorous bioinformatics analysis greatly enhanced the annotation of protein-coding genes in model methylotroph ME-AM1 genome. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  8. Analyzing gene expression data in mice with the Neuro Behavior Ontology.

    PubMed

    Hoehndorf, Robert; Hancock, John M; Hardy, Nigel W; Mallon, Ann-Marie; Schofield, Paul N; Gkoutos, Georgios V

    2014-02-01

    We have applied the Neuro Behavior Ontology (NBO), an ontology for the annotation of behavioral gene functions and behavioral phenotypes, to the annotation of more than 1,000 genes in the mouse that are known to play a role in behavior. These annotations can be explored by researchers interested in genes involved in particular behaviors and used computationally to provide insights into the behavioral phenotypes resulting from differences in gene expression. We developed the OntoFUNC tool and have applied it to enrichment analyses over the NBO to provide high-level behavioral interpretations of gene expression datasets. The resulting increase in the number of gene annotations facilitates the identification of behavioral or neurologic processes by assisting the formulation of hypotheses about the relationships between gene, processes, and phenotypic manifestations resulting from behavioral observations.

  9. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation.

    PubMed

    Sharma, Virag; Hiller, Michael

    2017-08-21

    Genome alignments provide a powerful basis to transfer gene annotations from a well-annotated reference genome to many other aligned genomes. The completeness of these annotations crucially depends on the sensitivity of the underlying genome alignment. Here, we investigated the impact of the genome alignment parameters and found that parameters with a higher sensitivity allow the detection of thousands of novel alignments between orthologous exons that have been missed before. In particular, comparisons between species separated by an evolutionary distance of >0.75 substitutions per neutral site, like human and other non-placental vertebrates, benefit from increased sensitivity. To systematically test if increased sensitivity improves comparative gene annotations, we built a multiple alignment of 144 vertebrate genomes and used this alignment to map human genes to the other 143 vertebrates with CESAR. We found that higher alignment sensitivity substantially improves the completeness of comparative gene annotations by adding on average 2382 and 7440 novel exons and 117 and 317 novel genes for mammalian and non-mammalian species, respectively. Our results suggest a more sensitive alignment strategy that should generally be used for genome alignments between distantly-related species. Our 144-vertebrate genome alignment and the comparative gene annotations (https://bds.mpi-cbg.de/hillerlab/144VertebrateAlignment_CESAR/) are a valuable resource for comparative genomics. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. IMG ER: a system for microbial genome annotation expert review and curation.

    PubMed

    Markowitz, Victor M; Mavromatis, Konstantinos; Ivanova, Natalia N; Chen, I-Min A; Chu, Ken; Kyrpides, Nikos C

    2009-09-01

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  11. High-performance web services for querying gene and variant annotation.

    PubMed

    Xin, Jiwen; Mark, Adam; Afrasiabi, Cyrus; Tsueng, Ginger; Juchler, Moritz; Gopal, Nikhil; Stupp, Gregory S; Putman, Timothy E; Ainscough, Benjamin J; Griffith, Obi L; Torkamani, Ali; Whetzel, Patricia L; Mungall, Christopher J; Mooney, Sean D; Su, Andrew I; Wu, Chunlei

    2016-05-06

    Efficient tools for data management and integration are essential for many aspects of high-throughput biology. In particular, annotations of genes and human genetic variants are commonly used but highly fragmented across many resources. Here, we describe MyGene.info and MyVariant.info, high-performance web services for querying gene and variant annotation information. These web services are currently accessed more than three million times permonth. They also demonstrate a generalizable cloud-based model for organizing and querying biological annotation information. MyGene.info and MyVariant.info are provided as high-performance web services, accessible at http://mygene.info and http://myvariant.info . Both are offered free of charge to the research community.

  12. Fuzzy measures on the Gene Ontology for gene product similarity.

    PubMed

    Popescu, Mihail; Keller, James M; Mitchell, Joyce A

    2006-01-01

    One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.

  13. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify codingmore » regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.« less

  14. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    PubMed Central

    Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M. T.; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D.; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D.; Henry, Christopher S.

    2014-01-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today’s annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  15. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.

  16. Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression

    PubMed Central

    Libbrecht, Maxwell W.; Ay, Ferhat; Hoffman, Michael M.; Gilbert, David M.; Bilmes, Jeffrey A.; Noble, William Stafford

    2015-01-01

    The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term “specific expression domains.” We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data. PMID:25677182

  17. Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression.

    PubMed

    Libbrecht, Maxwell W; Ay, Ferhat; Hoffman, Michael M; Gilbert, David M; Bilmes, Jeffrey A; Noble, William Stafford

    2015-04-01

    The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data. © 2015 Libbrecht et al.; Published by Cold Spring Harbor Laboratory Press.

  18. The MGED Ontology: a resource for semantics-based description of microarray experiments.

    PubMed

    Whetzel, Patricia L; Parkinson, Helen; Causton, Helen C; Fan, Liju; Fostel, Jennifer; Fragoso, Gilberto; Game, Laurence; Heiskanen, Mervi; Morrison, Norman; Rocca-Serra, Philippe; Sansone, Susanna-Assunta; Taylor, Chris; White, Joseph; Stoeckert, Christian J

    2006-04-01

    The generation of large amounts of microarray data and the need to share these data bring challenges for both data management and annotation and highlights the need for standards. MIAME specifies the minimum information needed to describe a microarray experiment and the Microarray Gene Expression Object Model (MAGE-OM) and resulting MAGE-ML provide a mechanism to standardize data representation for data exchange, however a common terminology for data annotation is needed to support these standards. Here we describe the MGED Ontology (MO) developed by the Ontology Working Group of the Microarray Gene Expression Data (MGED) Society. The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data. The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME. The MO does not attempt to incorporate terms from existing ontologies, e.g. those that deal with anatomical parts or developmental stages terms, but provides a framework to reference terms in other ontologies and therefore facilitates the use of ontologies in microarray data annotation. The MGED Ontology version.1.2.0 is available as a file in both DAML and OWL formats at http://mged.sourceforge.net/ontologies/index.php. Release notes and annotation examples are provided. The MO is also provided via the NCICB's Enterprise Vocabulary System (http://nciterms.nci.nih.gov/NCIBrowser/Dictionary.do). Stoeckrt@pcbi.upenn.edu Supplementary data are available at Bioinformatics online.

  19. Saccharomyces cerevisiae: gene annotation and genome variability, state of the art through comparative genomics.

    PubMed

    Louis, Ed

    2011-01-01

    In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.

  20. Dizeez: An Online Game for Human Gene-Disease Annotation

    PubMed Central

    Loguercio, Salvatore; Good, Benjamin M.; Su, Andrew I.

    2013-01-01

    Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org. PMID:23951102

  1. Genome-wide analysis of tandem repeats in plants and green algae

    Treesearch

    Zhixin Zhao; Cheng Guo; Sreeskandarajan Sutharzan; Pei Li; Craig Echt; Jie Zhang; Chun Liang

    2014-01-01

    Tandem repeats (TRs) extensively exist in the genomes of prokaryotes and eukaryotes. Based on the sequenced genomes and gene annotations of 31 plant and algal species in Phytozome version 8.0 (http://www.phytozome.net/), we examined TRs in a genome-wide scale, characterized their distributions and motif features, and explored their putative biological functions. Among...

  2. SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments

    PubMed Central

    Wiehe, Thomas; Gebauer-Jung, Steffi; Mitchell-Olds, Thomas; Guigó, Roderic

    2001-01-01

    Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors. PMID:11544202

  3. Comprehensive coverage of cardiovascular disease data in the disease portals at the Rat Genome Database.

    PubMed

    Wang, Shur-Jen; Laulederkind, Stanley J F; Hayman, G Thomas; Petri, Victoria; Smith, Jennifer R; Tutaj, Marek; Nigam, Rajni; Dwinell, Melinda R; Shimoyama, Mary

    2016-08-01

    Cardiovascular diseases are complex diseases caused by a combination of genetic and environmental factors. To facilitate progress in complex disease research, the Rat Genome Database (RGD) provides the community with a disease portal where genome objects and biological data related to cardiovascular diseases are systematically organized. The purpose of this study is to present biocuration at RGD, including disease, genetic, and pathway data. The RGD curation team uses controlled vocabularies/ontologies to organize data curated from the published literature or imported from disease and pathway databases. These organized annotations are associated with genes, strains, and quantitative trait loci (QTLs), thus linking functional annotations to genome objects. Screen shots from the web pages are used to demonstrate the organization of annotations at RGD. The human cardiovascular disease genes identified by annotations were grouped according to data sources and their annotation profiles were compared by in-house tools and other enrichment tools available to the public. The analysis results show that the imported cardiovascular disease genes from ClinVar and OMIM are functionally different from the RGD manually curated genes in terms of pathway and Gene Ontology annotations. The inclusion of disease genes from other databases enriches the collection of disease genes not only in quantity but also in quality. Copyright © 2016 the American Physiological Society.

  4. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  5. A computational genomics pipeline for prokaryotic sequencing projects.

    PubMed

    Kislyuk, Andrey O; Katz, Lee S; Agrawal, Sonia; Hagen, Matthew S; Conley, Andrew B; Jayaraman, Pushkala; Nelakuditi, Viswateja; Humphrey, Jay C; Sammons, Scott A; Govil, Dhwani; Mair, Raydel D; Tatti, Kathleen M; Tondella, Maria L; Harcourt, Brian H; Mayer, Leonard W; Jordan, I King

    2010-08-01

    New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems.

  6. A comprehensive transcript index of the human genome generated using microarrays and computational approaches

    PubMed Central

    Schadt, Eric E; Edwards, Stephen W; GuhaThakurta, Debraj; Holder, Dan; Ying, Lisa; Svetnik, Vladimir; Leonardson, Amy; Hart, Kyle W; Russell, Archie; Li, Guoya; Cavet, Guy; Castle, John; McDonagh, Paul; Kan, Zhengyan; Chen, Ronghua; Kasarskis, Andrew; Margarint, Mihai; Caceres, Ramon M; Johnson, Jason M; Armour, Christopher D; Garrett-Engele, Philip W; Tsinoremas, Nicholas F; Shoemaker, Daniel D

    2004-01-01

    Background Computational and microarray-based experimental approaches were used to generate a comprehensive transcript index for the human genome. Oligonucleotide probes designed from approximately 50,000 known and predicted transcript sequences from the human genome were used to survey transcription from a diverse set of 60 tissues and cell lines using ink-jet microarrays. Further, expression activity over at least six conditions was more generally assessed using genomic tiling arrays consisting of probes tiled through a repeat-masked version of the genomic sequence making up chromosomes 20 and 22. Results The combination of microarray data with extensive genome annotations resulted in a set of 28,456 experimentally supported transcripts. This set of high-confidence transcripts represents the first experimentally driven annotation of the human genome. In addition, the results from genomic tiling suggest that a large amount of transcription exists outside of annotated regions of the genome and serves as an example of how this activity could be measured on a genome-wide scale. Conclusions These data represent one of the most comprehensive assessments of transcriptional activity in the human genome and provide an atlas of human gene expression over a unique set of gene predictions. Before the annotation of the human genome is considered complete, however, the previously unannotated transcriptional activity throughout the genome must be fully characterized. PMID:15461792

  7. Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae

    PubMed Central

    2013-01-01

    Background Secondary metabolite production, a hallmark of filamentous fungi, is an expanding area of research for the Aspergilli. These compounds are potent chemicals, ranging from deadly toxins to therapeutic antibiotics to potential anti-cancer drugs. The genome sequences for multiple Aspergilli have been determined, and provide a wealth of predictive information about secondary metabolite production. Sequence analysis and gene overexpression strategies have enabled the discovery of novel secondary metabolites and the genes involved in their biosynthesis. The Aspergillus Genome Database (AspGD) provides a central repository for gene annotation and protein information for Aspergillus species. These annotations include Gene Ontology (GO) terms, phenotype data, gene names and descriptions and they are crucial for interpreting both small- and large-scale data and for aiding in the design of new experiments that further Aspergillus research. Results We have manually curated Biological Process GO annotations for all genes in AspGD with recorded functions in secondary metabolite production, adding new GO terms that specifically describe each secondary metabolite. We then leveraged these new annotations to predict roles in secondary metabolism for genes lacking experimental characterization. As a starting point for manually annotating Aspergillus secondary metabolite gene clusters, we used antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) and SMURF (Secondary Metabolite Unknown Regions Finder) algorithms to identify potential clusters in A. nidulans, A. fumigatus, A. niger and A. oryzae, which we subsequently refined through manual curation. Conclusions This set of 266 manually curated secondary metabolite gene clusters will facilitate the investigation of novel Aspergillus secondary metabolites. PMID:23617571

  8. Accurate and reproducible functional maps in 127 human cell types via 2D genome segmentation

    PubMed Central

    Hardison, Ross C.

    2017-01-01

    Abstract The Roadmap Epigenomics Consortium has published whole-genome functional annotation maps in 127 human cell types by integrating data from studies of multiple epigenetic marks. These maps have been widely used for studying gene regulation in cell type-specific contexts and predicting the functional impact of DNA mutations on disease. Here, we present a new map of functional elements produced by applying a method called IDEAS on the same data. The method has several unique advantages and outperforms existing methods, including that used by the Roadmap Epigenomics Consortium. Using five categories of independent experimental datasets, we compared the IDEAS and Roadmap Epigenomics maps. While the overall concordance between the two maps is high, the maps differ substantially in the prediction details and in their consistency of annotation of a given genomic position across cell types. The annotation from IDEAS is uniformly more accurate than the Roadmap Epigenomics annotation and the improvement is substantial based on several criteria. We further introduce a pipeline that improves the reproducibility of functional annotation maps. Thus, we provide a high-quality map of candidate functional regions across 127 human cell types and compare the quality of different annotation methods in order to facilitate biomedical research in epigenomics. PMID:28973456

  9. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis

    PubMed Central

    Tellgren-Roth, Christian; Baudo, Charles D.; Kennell, John C.; Sun, Sheng; Billmyre, R. Blake; Schröder, Markus S.; Andersson, Anna; Holm, Tina; Sigurgeirsson, Benjamin; Wu, Guangxi; Sankaranarayanan, Sundar Ram; Siddharthan, Rahul; Sanyal, Kaustuv; Lundeberg, Joakim; Nystedt, Björn; Boekhout, Teun; Dawson, Thomas L.; Heitman, Joseph

    2017-01-01

    Abstract Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies. PMID:28100699

  10. Mining featured biomarkers associated with prostatic carcinoma based on bioinformatics.

    PubMed

    Piao, Guanying; Wu, Jiarui

    2013-11-01

    To analyze the differentially expressed genes and identify featured biomarkers from prostatic carcinoma. The software "Significance Analysis of Microarray" (SAM) was used to identify the differentially coexpressed genes (DCGs). The DCGs existed in two datasets were analyzed by GO (Gene Ontology) functional annotation. A total of 389 DCGs were obtained. By GO analysis, we found these DCGs were closely related with the acinus development, TGF-β receptor and signal transduction pathways. Furthermore, five featured biomarkers were discovered by interaction analysis. These important signal pathways and oncogenes may provide potential therapeutic targets for prostatic carcinoma.

  11. PedAM: a database for Pediatric Disease Annotation and Medicine.

    PubMed

    Jia, Jinmeng; An, Zhongxin; Ming, Yue; Guo, Yongli; Li, Wei; Li, Xin; Liang, Yunxiang; Guo, Dongming; Tai, Jun; Chen, Geng; Jin, Yaqiong; Liu, Zhimei; Ni, Xin; Shi, Tieliu

    2018-01-04

    There is a significant number of children around the world suffering from the consequence of the misdiagnosis and ineffective treatment for various diseases. To facilitate the precision medicine in pediatrics, a database namely the Pediatric Disease Annotations & Medicines (PedAM) has been built to standardize and classify pediatric diseases. The PedAM integrates both biomedical resources and clinical data from Electronic Medical Records to support the development of computational tools, by which enables robust data analysis and integration. It also uses disease-manifestation (D-M) integrated from existing biomedical ontologies as prior knowledge to automatically recognize text-mined, D-M-specific syntactic patterns from 774 514 full-text articles and 8 848 796 abstracts in MEDLINE. Additionally, disease connections based on phenotypes or genes can be visualized on the web page of PedAM. Currently, the PedAM contains standardized 8528 pediatric disease terms (4542 unique disease concepts and 3986 synonyms) with eight annotation fields for each disease, including definition synonyms, gene, symptom, cross-reference (Xref), human phenotypes and its corresponding phenotypes in the mouse. The database PedAM is freely accessible at http://www.unimd.org/pedam/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Ensembl 2002: accommodating comparative genomics.

    PubMed

    Clamp, M; Andrews, D; Barker, D; Bevan, P; Cameron, G; Chen, Y; Clark, L; Cox, T; Cuff, J; Curwen, V; Down, T; Durbin, R; Eyras, E; Gilbert, J; Hammond, M; Hubbard, T; Kasprzyk, A; Keefe, D; Lehvaslaiho, H; Iyer, V; Melsopp, C; Mongin, E; Pettett, R; Potter, S; Rust, A; Schmidt, E; Searle, S; Slater, G; Smith, J; Spooner, W; Stabenau, A; Stalker, J; Stupka, E; Ureta-Vidal, A; Vastrik, I; Birney, E

    2003-01-01

    The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.

  13. ADGO: analysis of differentially expressed gene sets using composite GO annotation.

    PubMed

    Nam, Dougu; Kim, Sang-Bae; Kim, Seon-Kyu; Yang, Sungjin; Kim, Seon-Young; Chu, In-Sun

    2006-09-15

    Genes are typically expressed in modular manners in biological processes. Recent studies reflect such features in analyzing gene expression patterns by directly scoring gene sets. Gene annotations have been used to define the gene sets, which have served to reveal specific biological themes from expression data. However, current annotations have limited analytical power, because they are classified by single categories providing only unary information for the gene sets. Here we propose a method for discovering composite biological themes from expression data. We intersected two annotated gene sets from different categories of Gene Ontology (GO). We then scored the expression changes of all the single and intersected sets. In this way, we were able to uncover, for example, a gene set with the molecular function F and the cellular component C that showed significant expression change, while the changes in individual gene sets were not significant. We provided an exemplary analysis for HIV-1 immune response. In addition, we tested the method on 20 public datasets where we found many 'filtered' composite terms the number of which reached approximately 34% (a strong criterion, 5% significance) of the number of significant unary terms on average. By using composite annotation, we can derive new and improved information about disease and biological processes from expression data. We provide a web application (ADGO: http://array.kobic.re.kr/ADGO) for the analysis of differentially expressed gene sets with composite GO annotations. The user can analyze Affymetrix and dual channel array (spotted cDNA and spotted oligo microarray) data for four species: human, mouse, rat and yeast. chu@kribb.re.kr http://array.kobic.re.kr/ADGO.

  14. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    USDA-ARS?s Scientific Manuscript database

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  15. Using Gene Ontology to describe the role of the neurexin-neuroligin-SHANK complex in human, mouse and rat and its relevance to autism.

    PubMed

    Patel, Sejal; Roncaglia, Paola; Lovering, Ruth C

    2015-06-06

    People with an autistic spectrum disorder (ASD) display a variety of characteristic behavioral traits, including impaired social interaction, communication difficulties and repetitive behavior. This complex neurodevelopment disorder is known to be associated with a combination of genetic and environmental factors. Neurexins and neuroligins play a key role in synaptogenesis and neurexin-neuroligin adhesion is one of several processes that have been implicated in autism spectrum disorders. In this report we describe the manual annotation of a selection of gene products known to be associated with autism and/or the neurexin-neuroligin-SHANK complex and demonstrate how a focused annotation approach leads to the creation of more descriptive Gene Ontology (GO) terms, as well as an increase in both the number of gene product annotations and their granularity, thus improving the data available in the GO database. The manual annotations we describe will impact on the functional analysis of a variety of future autism-relevant datasets. Comprehensive gene annotation is an essential aspect of genomic and proteomic studies, as the quality of gene annotations incorporated into statistical analysis tools affects the effective interpretation of data obtained through genome wide association studies, next generation sequencing, proteomic and transcriptomic datasets.

  16. [Transcriptome analysis of Dunaliella viridis].

    PubMed

    Zhu, Shuai-qi; Gong, Yi-fu; Hang, Yu-qing; Liu, Hao; Wang, He-yu

    2015-08-01

    In order to understand the gene information, function, haloduric pathway (glycerolipid metabolism) and related key genes for Dunaliella viridis, we used Illumina HiSeqTM 2000 high-throughput sequencing technology to sequence its transcriptome. Trinity soft was used to assemble the data to form transcripts. Based on the Clusters of Orthologous Groups (COG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG ) databases, we carried out functional annotation and classification, pathway annotation, and the opening reading fragment (ORF) sequence prediction of transcripts. The key genes in the glycerolipid metabolism were analyzed. The results suggested that 81,593 transcripts were found, and 77,117 ORF sequences were predicted, accounting for 94.50% of all transcripts. COG classification results showed that 16,569 transcripts were assigned to 24 categories. GO classification annotated 76,436 transcripts. The number of transcripts for biologcial processes was 30,678, accounting for 40.14% of all transcripts. KEGG pathway analysis showed that 26,428 transcripts were annotated to 317 pathways, and 131 pathways were related to metabolism, accounting for 41.32% of all annotated pathways. Only one transcript was annotated as coding the key enzyme dihydroxyacetone kinase involved in the glycerolipid pathway. This enzyme could be related to glycerol biosynthesis under salt stress. This study further improved the gene information and laid the foundation of metabolic pathway research for Dunaliella viridis.

  17. Genome re-annotation of the wild strawberry Fragaria vesca using extensive Illumina- and SMRT-based RNA-seq datasets

    PubMed Central

    Li, Yongping; Wei, Wei; Feng, Jia; Luo, Huifeng; Pi, Mengting; Liu, Zhongchi; Kang, Chunying

    2018-01-01

    Abstract The genome of the wild diploid strawberry species Fragaria vesca, an ideal model system of cultivated strawberry (Fragaria × ananassa, octoploid) and other Rosaceae family crops, was first published in 2011 and followed by a new assembly (Fvb). However, the annotation for Fvb mainly relied on ab initio predictions and included only predicted coding sequences, therefore an improved annotation is highly desirable. Here, a new annotation version named v2.0.a2 was created for the Fvb genome by a pipeline utilizing one PacBio library, 90 Illumina RNA-seq libraries, and 9 small RNA-seq libraries. Altogether, 18,641 genes (55.6% out of 33,538 genes) were augmented with information on the 5′ and/or 3′ UTRs, 13,168 (39.3%) protein-coding genes were modified or newly identified, and 7,370 genes were found to possess alternative isoforms. In addition, 1,938 long non-coding RNAs, 171 miRNAs, and 51,714 small RNA clusters were integrated into the annotation. This new annotation of F. vesca is substantially improved in both accuracy and integrity of gene predictions, beneficial to the gene functional studies in strawberry and to the comparative genomic analysis of other horticultural crops in Rosaceae family. PMID:29036429

  18. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  19. Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations.

    PubMed

    Agapito, Giuseppe; Milano, Marianna; Guzzi, Pietro Hiram; Cannataro, Mario

    2016-01-01

    Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.

  20. Escherichia coli K-12: a cooperatively developed annotation snapshot—2005

    PubMed Central

    Riley, Monica; Abe, Takashi; Arnaud, Martha B.; Berlyn, Mary K.B.; Blattner, Frederick R.; Chaudhuri, Roy R.; Glasner, Jeremy D.; Horiuchi, Takashi; Keseler, Ingrid M.; Kosuge, Takehide; Mori, Hirotada; Perna, Nicole T.; Plunkett, Guy; Rudd, Kenneth E.; Serres, Margrethe H.; Thomas, Gavin H.; Thomson, Nicholas R.; Wishart, David; Wanner, Barry L.

    2006-01-01

    The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product on the basis of experimental evidence or sequence analysis. Since both kinds of evidence are constantly expanding, no annotation is complete at any moment in time. This is a snapshot analysis based on the most recent genome sequences of two E.coli K-12 bacteria. An accurate and up-to-date description of E.coli K-12 genes is of particular importance to the scientific community because experimentally determined properties of its gene products provide fundamental information for annotation of innumerable genes of other organisms. Availability of the complete genome sequence of two K-12 strains allows comparison of their genotypes and mutant status of alleles. PMID:16397293

  1. Using Comparative Genomics for Inquiry-Based Learning to Dissect Virulence of "Escherichia coli" O157:H7 and "Yersinia pestis"

    ERIC Educational Resources Information Center

    Baumler, David J.; Banta, Lois M.; Hung, Kai F.; Schwarz, Jodi A.; Cabot, Eric L.; Glasner, Jeremy D.; Perna, Nicole T.

    2012-01-01

    Genomics and bioinformatics are topics of increasing interest in undergraduate biological science curricula. Many existing exercises focus on gene annotation and analysis of a single genome. In this paper, we present two educational modules designed to enable students to learn and apply fundamental concepts in comparative genomics using examples…

  2. Gene function prediction based on the Gene Ontology hierarchical structure.

    PubMed

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.

  3. Improved methods and resources for paramecium genomics: transcription units, gene annotation and gene expression.

    PubMed

    Arnaiz, Olivier; Van Dijk, Erwin; Bétermier, Mireille; Lhuillier-Akakpo, Maoussi; de Vanssay, Augustin; Duharcourt, Sandra; Sallet, Erika; Gouzy, Jérôme; Sperling, Linda

    2017-06-26

    The 15 sibling species of the Paramecium aurelia cryptic species complex emerged after a whole genome duplication that occurred tens of millions of years ago. Given extensive knowledge of the genetics and epigenetics of Paramecium acquired over the last century, this species complex offers a uniquely powerful system to investigate the consequences of whole genome duplication in a unicellular eukaryote as well as the genetic and epigenetic mechanisms that drive speciation. High quality Paramecium gene models are important for research using this system. The major aim of the work reported here was to build an improved gene annotation pipeline for the Paramecium lineage. We generated oriented RNA-Seq transcriptome data across the sexual process of autogamy for the model species Paramecium tetraurelia. We determined, for the first time in a ciliate, candidate P. tetraurelia transcription start sites using an adapted Cap-Seq protocol. We developed TrUC, multi-threaded Perl software that in conjunction with TopHat mapping of RNA-Seq data to a reference genome, predicts transcription units for the annotation pipeline. We used EuGene software to combine annotation evidence. The high quality gene structural annotations obtained for P. tetraurelia were used as evidence to improve published annotations for 3 other Paramecium species. The RNA-Seq data were also used for differential gene expression analysis, providing a gene expression atlas that is more sensitive than the previously established microarray resource. We have developed a gene annotation pipeline tailored for the compact genomes and tiny introns of Paramecium species. A novel component of this pipeline, TrUC, predicts transcription units using Cap-Seq and oriented RNA-Seq data. TrUC could prove useful beyond Paramecium, especially in the case of high gene density. Accurate predictions of 3' and 5' UTR will be particularly valuable for studies of gene expression (e.g. nucleosome positioning, identification of cis regulatory motifs). The P. tetraurelia improved transcriptome resource, gene annotations for P. tetraurelia, P. biaurelia, P. sexaurelia and P. caudatum, and Paramecium-trained EuGene configuration are available through ParameciumDB ( http://paramecium.i2bc.paris-saclay.fr ). TrUC software is freely distributed under a GNU GPL v3 licence ( https://github.com/oarnaiz/TrUC ).

  4. Approaches to Fungal Genome Annotation

    PubMed Central

    Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.

    2011-01-01

    Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117

  5. Assembly and comparison of two closely related Brassica napus genomes.

    PubMed

    Bayer, Philipp E; Hurgobin, Bhavna; Golicz, Agnieszka A; Chan, Chon-Kit Kenneth; Yuan, Yuxuan; Lee, HueyTyng; Renton, Michael; Meng, Jinling; Li, Ruiyuan; Long, Yan; Zou, Jun; Bancroft, Ian; Chalhoub, Boulos; King, Graham J; Batley, Jacqueline; Edwards, David

    2017-12-01

    As an increasing number of plant genome sequences become available, it is clear that gene content varies between individuals, and the challenge arises to predict the gene content of a species. However, genome comparison is often confounded by variation in assembly and annotation. Differentiating between true gene absence and variation in assembly or annotation is essential for the accurate identification of conserved and variable genes in a species. Here, we present the de novo assembly of the B. napus cultivar Tapidor and comparison with an improved assembly of the Brassica napus cultivar Darmor-bzh. Both cultivars were annotated using the same method to allow comparison of gene content. We identified genes unique to each cultivar and differentiate these from artefacts due to variation in the assembly and annotation. We demonstrate that using a common annotation pipeline can result in different gene predictions, even for closely related cultivars, and repeat regions which collapse during assembly impact whole genome comparison. After accounting for differences in assembly and annotation, we demonstrate that the genome of Darmor-bzh contains a greater number of genes than the genome of Tapidor. Our results are the first step towards comparison of the true differences between B. napus genomes and highlight the potential sources of error in future production of a B. napus pangenome. © 2017 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  6. The Co-regulation Data Harvester: Automating gene annotation starting from a transcriptome database

    NASA Astrophysics Data System (ADS)

    Tsypin, Lev M.; Turkewitz, Aaron P.

    Identifying co-regulated genes provides a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, a process much slower than genome sequencing per se. Tetrahymena thermophila, a unicellular eukaryote, has been a useful model organism and has a fully sequenced but sparsely annotated genome. One important resource for studying this organism has been an online transcriptomic database. We have developed an automated approach to gene annotation in the context of transcriptome data in T. thermophila, called the Co-regulation Data Harvester (CDH). Beginning with a gene of interest, the CDH identifies co-regulated genes by accessing the Tetrahymena transcriptome database. It then identifies their closely related genes (orthologs) in other organisms by using reciprocal BLAST searches. Finally, it collates the annotations of those orthologs' functions, which provides the user with information to help predict the cellular role of the initial query. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.

  7. Guidelines for the functional annotation of microRNAs using the Gene Ontology

    PubMed Central

    D'Eustachio, Peter; Smith, Jennifer R.; Zampetaki, Anna

    2016-01-01

    MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual). PMID:26917558

  8. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

    PubMed

    Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

    2013-02-01

    The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

  9. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis.

    PubMed

    Zhu, Yafeng; Engström, Pär G; Tellgren-Roth, Christian; Baudo, Charles D; Kennell, John C; Sun, Sheng; Billmyre, R Blake; Schröder, Markus S; Andersson, Anna; Holm, Tina; Sigurgeirsson, Benjamin; Wu, Guangxi; Sankaranarayanan, Sundar Ram; Siddharthan, Rahul; Sanyal, Kaustuv; Lundeberg, Joakim; Nystedt, Björn; Boekhout, Teun; Dawson, Thomas L; Heitman, Joseph; Scheynius, Annika; Lehtiö, Janne

    2017-03-17

    Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. Semantic Similarity in Biomedical Ontologies

    PubMed Central

    Pesquita, Catia; Faria, Daniel; Falcão, André O.; Lord, Phillip; Couto, Francisco M.

    2009-01-01

    In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization. We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies. Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research. PMID:19649320

  11. Automatic annotation of protein motif function with Gene Ontology terms.

    PubMed

    Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

    2004-09-02

    Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.

  12. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  13. An integrated and comparative approach towards identification, characterization and functional annotation of candidate genes for drought tolerance in sorghum (Sorghum bicolor (L.) Moench).

    PubMed

    Woldesemayat, Adugna Abdi; Van Heusden, Peter; Ndimba, Bongani K; Christoffels, Alan

    2017-12-22

    Drought is the most disastrous abiotic stress that severely affects agricultural productivity worldwide. Understanding the biological basis of drought-regulated traits, requires identification and an in-depth characterization of genetic determinants using model organisms and high-throughput technologies. However, studies on drought tolerance have generally been limited to traditional candidate gene approach that targets only a single gene in a pathway that is related to a trait. In this study, we used sorghum, one of the model crops that is well adapted to arid regions, to mine genes and define determinants for drought tolerance using drought expression libraries and RNA-seq data. We provide an integrated and comparative in silico candidate gene identification, characterization and annotation approach, with an emphasis on genes playing a prominent role in conferring drought tolerance in sorghum. A total of 470 non-redundant functionally annotated drought responsive genes (DRGs) were identified using experimental data from drought responses by employing pairwise sequence similarity searches, pathway and interpro-domain analysis, expression profiling and orthology relation. Comparison of the genomic locations between these genes and sorghum quantitative trait loci (QTLs) showed that 40% of these genes were co-localized with QTLs known for drought tolerance. The genome reannotation conducted using the Program to Assemble Spliced Alignment (PASA), resulted in 9.6% of existing single gene models being updated. In addition, 210 putative novel genes were identified using AUGUSTUS and PASA based analysis on expression dataset. Among these, 50% were single exonic, 69.5% represented drought responsive and 5.7% were complete gene structure models. Analysis of biochemical metabolism revealed 14 metabolic pathways that are related to drought tolerance and also had a strong biological network, among categories of genes involved. Identification of these pathways, signifies the interplay of biochemical reactions that make up the metabolic network, constituting fundamental interface for sorghum defence mechanism against drought stress. This study suggests untapped natural variability in sorghum that could be used for developing drought tolerance. The data presented here, may be regarded as an initial reference point in functional and comparative genomics in the Gramineae family.

  14. De novo leaf and root transcriptome analysis to identify putative genes involved in triterpenoid saponins biosynthesis in Hedera helix L.

    PubMed Central

    Li, Fang; Xu, Zijian; Sun, Mengli; Cong, Hanqing; Qiao, Fei; Zhong, Xiaohong

    2017-01-01

    Hedera helix L. is an important traditional medicinal plant in Europe. The main active components are triterpenoid saponins, but none of the potential enzymes involved in triterpenoid saponins biosynthesis have been discovered and annotated. Here is reported the first study of global transcriptome analyses using the Illumina HiSeq™ 2500 platform for H. helix. In total, over 24 million clean reads were produced and 96,333 unigenes were assembled, with an average length of 1385 nt; more than 79,085 unigenes had at least one significant match to an existing gene model. Differentially Expressed Gene analysis identified 6,222 and 7,012 unigenes which were expressed either higher or lower in leaf samples when compared with roots. After functional annotation and classification, two pathways and 410 unigenes related to triterpenoid saponins biosynthesis were discovered. The accuracy of these de novo sequences was validated by RT-qPCR analysis and a RACE clone. These data will enrich our knowledge of triterpenoid saponin biosynthesis and provide a theoretical foundation for molecular research on H. helix. PMID:28771546

  15. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis

    PubMed Central

    Neerincx, Pieter BT; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack AM; Groenen, Martien AM; Klopp, Christophe

    2009-01-01

    Background Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. Results IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines. For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. Conclusion In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation. PMID:19615109

  16. Comparison of three microarray probe annotation pipelines: differences in strategies and their effect on downstream analysis.

    PubMed

    Neerincx, Pieter Bt; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack Am; Groenen, Martien Am; Klopp, Christophe

    2009-07-16

    Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines.For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation.

  17. Towards a complete map of the human long non-coding RNA transcriptome.

    PubMed

    Uszczynska-Ratajczak, Barbara; Lagarde, Julien; Frankish, Adam; Guigó, Roderic; Johnson, Rory

    2018-05-23

    Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.

  18. Annotation of Ehux ESTs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  19. Introducing meta-services for biomedical information extraction

    PubMed Central

    Leitner, Florian; Krallinger, Martin; Rodriguez-Penagos, Carlos; Hakenberg, Jörg; Plake, Conrad; Kuo, Cheng-Ju; Hsu, Chun-Nan; Tsai, Richard Tzong-Han; Hung, Hsi-Chuan; Lau, William W; Johnson, Calvin A; Sætre, Rune; Yoshida, Kazuhiro; Chen, Yan Hua; Kim, Sun; Shin, Soo-Yong; Zhang, Byoung-Tak; Baumgartner, William A; Hunter, Lawrence; Haddow, Barry; Matthews, Michael; Wang, Xinglong; Ruch, Patrick; Ehrler, Frédéric; Özgür, Arzucan; Erkan, Güneş; Radev, Dragomir R; Krauthammer, Michael; Luong, ThaiBinh; Hoffmann, Robert; Sander, Chris; Valencia, Alfonso

    2008-01-01

    We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; ). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations. PMID:18834497

  20. Prediction of gene expression in embryonic structures of Drosophila melanogaster.

    PubMed

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-07-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.

  1. Prediction of Gene Expression in Embryonic Structures of Drosophila melanogaster

    PubMed Central

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-01-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms. PMID:17658945

  2. Simultaneous gene finding in multiple genomes.

    PubMed

    König, Stefanie; Romoth, Lars W; Gerischer, Lizzy; Stanke, Mario

    2016-11-15

    As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  3. Sequencing and comparative genomic analysis of 1227 Felis catus cDNA sequences enriched for developmental, clinical and nutritional phenotypes

    PubMed Central

    2012-01-01

    Background The feline genome is valuable to the veterinary and model organism genomics communities because the cat is an obligate carnivore and a model for endangered felids. The initial public release of the Felis catus genome assembly provided a framework for investigating the genomic basis of feline biology. However, the entire set of protein coding genes has not been elucidated. Results We identified and characterized 1227 protein coding feline sequences, of which 913 map to public sequences and 314 are novel. These sequences have been deposited into NCBI's genbank database and complement public genomic resources by providing additional protein coding sequences that fill in some of the gaps in the feline genome assembly. Through functional and comparative genomic analyses, we gained an understanding of the role of these sequences in feline development, nutrition and health. Specifically, we identified 104 orthologs of human genes associated with Mendelian disorders. We detected negative selection within sequences with gene ontology annotations associated with intracellular trafficking, cytoskeleton and muscle functions. We detected relatively less negative selection on protein sequences encoding extracellular networks, apoptotic pathways and mitochondrial gene ontology annotations. Additionally, we characterized feline cDNA sequences that have mouse orthologs associated with clinical, nutritional and developmental phenotypes. Together, this analysis provides an overview of the value of our cDNA sequences and enhances our understanding of how the feline genome is similar to, and different from other mammalian genomes. Conclusions The cDNA sequences reported here expand existing feline genomic resources by providing high-quality sequences annotated with comparative genomic information providing functional, clinical, nutritional and orthologous gene information. PMID:22257742

  4. Functional Annotations of Paralogs: A Blessing and a Curse

    PubMed Central

    Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

    2016-01-01

    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105

  5. GoGene: gene annotation in the fast lane.

    PubMed

    Plake, Conrad; Royer, Loic; Winnenburg, Rainer; Hakenberg, Jörg; Schroeder, Michael

    2009-07-01

    High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.

  6. A computational genomics pipeline for prokaryotic sequencing projects

    PubMed Central

    Kislyuk, Andrey O.; Katz, Lee S.; Agrawal, Sonia; Hagen, Matthew S.; Conley, Andrew B.; Jayaraman, Pushkala; Nelakuditi, Viswateja; Humphrey, Jay C.; Sammons, Scott A.; Govil, Dhwani; Mair, Raydel D.; Tatti, Kathleen M.; Tondella, Maria L.; Harcourt, Brian H.; Mayer, Leonard W.; Jordan, I. King

    2010-01-01

    Motivation: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. Results: We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. Availability and implementation: The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems. Contact: king.jordan@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20519285

  7. Enabling a Community to Dissect an Organism: Overview of the Neurospora Functional Genomics Project

    PubMed Central

    Dunlap, Jay C.; Borkovich, Katherine A.; Henn, Matthew R.; Turner, Gloria E.; Sachs, Matthew S.; Glass, N. Louise; McCluskey, Kevin; Plamann, Michael; Galagan, James E.; Birren, Bruce W.; Weiss, Richard L.; Townsend, Jeffrey P.; Loros, Jennifer J.; Nelson, Mary Anne; Lambreghts, Randy; Colot, Hildur V.; Park, Gyungsoon; Collopy, Patrick; Ringelberg, Carol; Crew, Christopher; Litvinkova, Liubov; DeCaprio, Dave; Hood, Heather M.; Curilla, Susan; Shi, Mi; Crawford, Matthew; Koerhsen, Michael; Montgomery, Phil; Larson, Lisa; Pearson, Matthew; Kasuga, Takao; Tian, Chaoguang; Baştürkmen, Meray; Altamirano, Lorena; Xu, Junhuan

    2013-01-01

    A consortium of investigators is engaged in a functional genomics project centered on the filamentous fungus Neurospora, with an eye to opening up the functional genomic analysis of all the filamentous fungi. The overall goal of the four interdependent projects in this effort is to acccomplish functional genomics, annotation, and expression analyses of Neurospora crassa, a filamentous fungus that is an established model for the assemblage of over 250,000 species of nonyeast fungi. Building from the completely sequenced 43-Mb Neurospora genome, Project 1 is pursuing the systematic disruption of genes through targeted gene replacements, phenotypic analysis of mutant strains, and their distribution to the scientific community at large. Project 2, through a primary focus in Annotation and Bioinformatics, has developed a platform for electronically capturing community feedback and data about the existing annotation, while building and maintaining a database to capture and display information about phenotypes. Oligonucleotide-based microarrays created in Project 3 are being used to collect baseline expression data for the nearly 11,000 distinguishable transcripts in Neurospora under various conditions of growth and development, and eventually to begin to analyze the global effects of loss of novel genes in strains created by Project 1. cDNA libraries generated in Project 4 document the overall complexity of expressed sequences in Neurospora, including alternative splicing alternative promoters and antisense transcripts. In addition, these studies have driven the assembly of an SNP map presently populated by nearly 300 markers that will greatly accelerate the positional cloning of genes. PMID:17352902

  8. Text Mining in Cancer Gene and Pathway Prioritization

    PubMed Central

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes. PMID:25392685

  9. Text mining in cancer gene and pathway prioritization.

    PubMed

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  10. GENCODE: the reference human genome annotation for The ENCODE Project.

    PubMed

    Harrow, Jennifer; Frankish, Adam; Gonzalez, Jose M; Tapanari, Electra; Diekhans, Mark; Kokocinski, Felix; Aken, Bronwen L; Barrell, Daniel; Zadissa, Amonida; Searle, Stephen; Barnes, If; Bignell, Alexandra; Boychenko, Veronika; Hunt, Toby; Kay, Mike; Mukherjee, Gaurab; Rajan, Jeena; Despacio-Reyes, Gloria; Saunders, Gary; Steward, Charles; Harte, Rachel; Lin, Michael; Howald, Cédric; Tanzer, Andrea; Derrien, Thomas; Chrast, Jacqueline; Walters, Nathalie; Balasubramanian, Suganthi; Pei, Baikang; Tress, Michael; Rodriguez, Jose Manuel; Ezkurdia, Iakes; van Baren, Jeltje; Brent, Michael; Haussler, David; Kellis, Manolis; Valencia, Alfonso; Reymond, Alexandre; Gerstein, Mark; Guigó, Roderic; Hubbard, Tim J

    2012-09-01

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

  11. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC).

    PubMed

    Neuhaus, Klaus; Landstorfer, Richard; Fellner, Lea; Simon, Svenja; Schafferhans, Andrea; Goldberg, Tatyana; Marx, Harald; Ozoline, Olga N; Rost, Burkhard; Kuster, Bernhard; Keim, Daniel A; Scherer, Siegfried

    2016-02-24

    Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.

  12. Rapid quantification of mutant fitness in diverse bacteria by sequencing randomly bar-coded transposons

    DOE PAGES

    Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.; ...

    2015-05-12

    Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with anymore » transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative D-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. A large challenge in microbiology is the functional assessment of the millions of uncharacterized genes identified by genome sequencing. Transposon mutagenesis coupled to next-generation sequencing (TnSeq) is a powerful approach to assign phenotypes and functions to genes. However, the current strategies for TnSeq are too laborious to be applied to hundreds of experimental conditions across multiple bacteria. Here, we describe an approach, random bar code transposon-site sequencing (RB-TnSeq), which greatly simplifies the measurement of gene fitness by using bar code sequencing (BarSeq) to monitor the abundance of mutants. We performed 387 genome-wide fitness assays across five bacteria and identified phenotypes for over 5,000 genes. RB-TnSeq can be applied to diverse bacteria and is a powerful tool to annotate uncharacterized genes using phenotype data.« less

  13. Rapid quantification of mutant fitness in diverse bacteria by sequencing randomly bar-coded transposons

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.

    Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with anymore » transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative D-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. A large challenge in microbiology is the functional assessment of the millions of uncharacterized genes identified by genome sequencing. Transposon mutagenesis coupled to next-generation sequencing (TnSeq) is a powerful approach to assign phenotypes and functions to genes. However, the current strategies for TnSeq are too laborious to be applied to hundreds of experimental conditions across multiple bacteria. Here, we describe an approach, random bar code transposon-site sequencing (RB-TnSeq), which greatly simplifies the measurement of gene fitness by using bar code sequencing (BarSeq) to monitor the abundance of mutants. We performed 387 genome-wide fitness assays across five bacteria and identified phenotypes for over 5,000 genes. RB-TnSeq can be applied to diverse bacteria and is a powerful tool to annotate uncharacterized genes using phenotype data.« less

  14. nGASP--the nematode genome annotation assessment project.

    PubMed

    Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.

  15. Sequence- and Structure-Based Functional Annotation and Assessment of Metabolic Transporters in Aspergillus oryzae: A Representative Case Study

    PubMed Central

    Raethong, Nachon; Wong-ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa

    2016-01-01

    Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H+-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction. PMID:27274991

  16. Sequence- and Structure-Based Functional Annotation and Assessment of Metabolic Transporters in Aspergillus oryzae: A Representative Case Study.

    PubMed

    Raethong, Nachon; Wong-Ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa

    2016-01-01

    Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H(+)-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction.

  17. A genome scale metabolic network for rice and accompanying analysis of tryptophan, auxin and serotonin biosynthesis regulation under biotic stress

    USDA-ARS?s Scientific Manuscript database

    Functional annotations of large plant genome projects mostly provide information on gene function and gene families based on the presence of protein domains and gene homology, but not necessarily in association with gene expression or metabolic and regulatory networks. These additional annotations a...

  18. GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data.

    PubMed

    Rue-Albrecht, Kévin; McGettigan, Paul A; Hernández, Belinda; Nalpas, Nicolas C; Magee, David A; Parnell, Andrew C; Gordon, Stephen V; MacHugh, David E

    2016-03-11

    Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.

  19. Genomic resources for songbird research and their use in characterizing gene expression during brain development

    PubMed Central

    Li, XiaoChing; Wang, Xiu-Jie; Tannenhauser, Jonathan; Podell, Sheila; Mukherjee, Piali; Hertel, Moritz; Biane, Jeremy; Masuda, Shoko; Nottebohm, Fernando; Gaasterland, Terry

    2007-01-01

    Vocal learning and neuronal replacement have been studied extensively in songbirds, but until recently, few molecular and genomic tools for songbird research existed. Here we describe new molecular/genomic resources developed in our laboratory. We made cDNA libraries from zebra finch (Taeniopygia guttata) brains at different developmental stages. A total of 11,000 cDNA clones from these libraries, representing 5,866 unique gene transcripts, were randomly picked and sequenced from the 3′ ends. A web-based database was established for clone tracking, sequence analysis, and functional annotations. Our cDNA libraries were not normalized. Sequencing ESTs without normalization produced many developmental stage-specific sequences, yielding insights into patterns of gene expression at different stages of brain development. In particular, the cDNA library made from brains at posthatching day 30–50, corresponding to the period of rapid song system development and song learning, has the most diverse and richest set of genes expressed. We also identified five microRNAs whose sequences are highly conserved between zebra finch and other species. We printed cDNA microarrays and profiled gene expression in the high vocal center of both adult male zebra finches and canaries (Serinus canaria). Genes differentially expressed in the high vocal center were identified from the microarray hybridization results. Selected genes were validated by in situ hybridization. Networks among the regulated genes were also identified. These resources provide songbird biologists with tools for genome annotation, comparative genomics, and microarray gene expression analysis. PMID:17426146

  20. Phylogeny of C4-photosynthesis enzymes based on algal transcriptomic and genomic data supports an archaeal/proteobacterial origin and multiple duplication for most C4-related genes.

    PubMed

    Chi, Shan; Wu, Shuangxiu; Yu, Jun; Wang, Xumin; Tang, Xuexi; Liu, Tao

    2014-01-01

    Both Calvin-Benson-Bassham (C3) and Hatch-Slack (C4) cycles are most important autotrophic CO2 fixation pathways on today's Earth. C3 cycle is believed to be originated from cyanobacterial endosymbiosis. However, studies on evolution of different biochemical variants of C4 photosynthesis are limited to tracheophytes and origins of C4-cycle genes are not clear till now. Our comprehensive analyses on bioinformatics and phylogenetics of novel transcriptomic sequencing data of 21 rhodophytes and 19 Phaeophyceae marine species and public genomic data of more algae, tracheophytes, cyanobacteria, proteobacteria and archaea revealed the origin and evolution of C4 cycle-related genes. Almost all of C4-related genes were annotated in extensive algal lineages with proteobacterial or archaeal origins, except for phosphoenolpyruvate carboxykinase (PCK) and aspartate aminotransferase (AST) with both cyanobacterial and archaeal/proteobacterial origin. Notably, cyanobacteria may not possess complete C4 pathway because of the flawed annotation of pyruvate orthophosphate dikinase (PPDK) genes in public data. Most C4 cycle-related genes endured duplication and gave rise to functional differentiation and adaptation in different algal lineages. C4-related genes of NAD-ME (NAD-malic enzyme) and PCK subtypes exist in most algae and may be primitive ones, while NADP-ME (NADP-malic enzyme) subtype genes might evolve from NAD-ME subtype by gene duplication in chlorophytes and tracheophytes.

  1. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  2. APPRIS: annotation of principal and alternative splice isoforms

    PubMed Central

    Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L.

    2013-01-01

    Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform. PMID:23161672

  3. Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae

    PubMed Central

    Meng, Shaowu; Brown, Douglas E; Ebbole, Daniel J; Torto-Alalibo, Trudy; Oh, Yeon Yee; Deng, Jixin; Mitchell, Thomas K; Dean, Ralph A

    2009-01-01

    Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System. Conclusion Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae. PMID:19278556

  4. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

    PubMed Central

    Thomas, Paul D.; Wood, Valerie; Mungall, Christopher J.; Lewis, Suzanna E.; Blake, Judith A.

    2012-01-01

    A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis. PMID:22359495

  5. Rapid Quantification of Mutant Fitness in Diverse Bacteria by Sequencing Randomly Bar-Coded Transposons

    PubMed Central

    Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.; Lamson, Jacob S.; He, Jennifer; Hoover, Cindi A.; Blow, Matthew J.; Bristow, James; Butland, Gareth

    2015-01-01

    ABSTRACT Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with any transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative d-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. PMID:25968644

  6. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus.

    PubMed

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen; Arnqvist, Göran

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally.

  7. The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus

    PubMed Central

    Sayadi, Ahmed; Immonen, Elina; Bayram, Helen

    2016-01-01

    Despite their unparalleled biodiversity, the genomic resources available for beetles (Coleoptera) remain relatively scarce. We present an integrative and high quality annotated transcriptome of the beetle Callosobruchus maculatus, an important and cosmopolitan agricultural pest as well as an emerging model species in ecology and evolutionary biology. Using Illumina sequencing technology, we sequenced 492 million read pairs generated from 51 samples of different developmental stages (larvae, pupae and adults) of C. maculatus. Reads were de novo assembled using the Trinity software, into a single combined assembly as well as into three separate assemblies based on data from the different developmental stages. The combined assembly generated 218,192 transcripts and 145,883 putative genes. Putative genes were annotated with the Blast2GO software and the Trinotate pipeline. In total, 33,216 putative genes were successfully annotated using Blastx against the Nr (non-redundant) database and 13,382 were assigned to 34,100 Gene Ontology (GO) terms. We classified 5,475 putative genes into Clusters of Orthologous Groups (COG) and 116 metabolic pathways maps were predicted based on the annotation. Our analyses suggested that the transcriptional specificity increases with ontogeny. For example, out of 33,216 annotated putative genes, 51 were only expressed in larvae, 63 only in pupae and 171 only in adults. Our study illustrates the importance of including samples from several developmental stages when the aim is to provide an integrative and high quality annotated transcriptome. Our results will represent an invaluable resource for those working with the ecology, evolution and pest control of C. maculatus, as well for comparative studies of the transcriptomics and genomics of beetles more generally. PMID:27442123

  8. Functional sequencing read annotation for high precision microbiome analysis

    PubMed Central

    Zhu, Chengsheng; Miller, Maximilian; Marpaka, Srinayani; Vaysberg, Pavel; Rühlemann, Malte C; Wu, Guojun; Heinsen, Femke-Anouska; Tempel, Marie; Zhao, Liping; Lieb, Wolfgang; Franke, Andre; Bromberg, Yana

    2018-01-01

    Abstract The vast majority of microorganisms on Earth reside in often-inseparable environment-specific communities—microbiomes. Meta-genomic/-transcriptomic sequencing could reveal the otherwise inaccessible functionality of microbiomes. However, existing analytical approaches focus on attributing sequencing reads to known genes/genomes, often failing to make maximal use of available data. We created faser (functional annotation of sequencing reads), an algorithm that is optimized to map reads to molecular functions encoded by the read-correspondent genes. The mi-faser microbiome analysis pipeline, combining faser with our manually curated reference database of protein functions, accurately annotates microbiome molecular functionality. mi-faser’s minutes-per-microbiome processing speed is significantly faster than that of other methods, allowing for large scale comparisons. Microbiome function vectors can be compared between different conditions to highlight environment-specific and/or time-dependent changes in functionality. Here, we identified previously unseen oil degradation-specific functions in BP oil-spill data, as well as functional signatures of individual-specific gut microbiome responses to a dietary intervention in children with Prader–Willi syndrome. Our method also revealed variability in Crohn's Disease patient microbiomes and clearly distinguished them from those of related healthy individuals. Our analysis highlighted the microbiome role in CD pathogenicity, demonstrating enrichment of patient microbiomes in functions that promote inflammation and that help bacteria survive it. PMID:29194524

  9. Multi-scale chromatin state annotation using a hierarchical hidden Markov model

    NASA Astrophysics Data System (ADS)

    Marco, Eugenio; Meuleman, Wouter; Huang, Jialiang; Glass, Kimberly; Pinello, Luca; Wang, Jianrong; Kellis, Manolis; Yuan, Guo-Cheng

    2017-04-01

    Chromatin-state analysis is widely applied in the studies of development and diseases. However, existing methods operate at a single length scale, and therefore cannot distinguish large domains from isolated elements of the same type. To overcome this limitation, we present a hierarchical hidden Markov model, diHMM, to systematically annotate chromatin states at multiple length scales. We apply diHMM to analyse a public ChIP-seq data set. diHMM not only accurately captures nucleosome-level information, but identifies domain-level states that vary in nucleosome-level state composition, spatial distribution and functionality. The domain-level states recapitulate known patterns such as super-enhancers, bivalent promoters and Polycomb repressed regions, and identify additional patterns whose biological functions are not yet characterized. By integrating chromatin-state information with gene expression and Hi-C data, we identify context-dependent functions of nucleosome-level states. Thus, diHMM provides a powerful tool for investigating the role of higher-order chromatin structure in gene regulation.

  10. A Comprehensive Curation Shows the Dynamic Evolutionary Patterns of Prokaryotic CRISPRs.

    PubMed

    Mai, Guoqin; Ge, Ruiquan; Sun, Guoquan; Meng, Qinghan; Zhou, Fengfeng

    2016-01-01

    Motivation. Clustered regularly interspaced short palindromic repeat (CRISPR) is a genetic element with active regulation roles for foreign invasive genes in the prokaryotic genomes and has been engineered to work with the CRISPR-associated sequence (Cas) gene Cas9 as one of the modern genome editing technologies. Due to inconsistent definitions, the existing CRISPR detection programs seem to have missed some weak CRISPR signals. Results. This study manually curates all the currently annotated CRISPR elements in the prokaryotic genomes and proposes 95 updates to the annotations. A new definition is proposed to cover all the CRISPRs. The comprehensive comparison of CRISPR numbers on the taxonomic levels of both domains and genus shows high variations for closely related species even in the same genus. The detailed investigation of how CRISPRs are evolutionarily manipulated in the 8 completely sequenced species in the genus Thermoanaerobacter demonstrates that transposons act as a frequent tool for splitting long CRISPRs into shorter ones along a long evolutionary history.

  11. Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations

    PubMed Central

    Van Vooren, Steven; Thienpont, Bernard; Menten, Björn; Speleman, Frank; Moor, Bart De; Vermeesch, Joris; Moreau, Yves

    2007-01-01

    Biomedical literature provides a rich but unstructured source of associations between chromosomal regions and biomedical concepts. By mining MEDLINE abstracts, we annotate the human genome at the level of cytogenetic bands. Our method creates a set of chromosomal aberration maps that associate cytogenetic bands to biomedical concepts from a variety of controlled vocabularies, including disease, dysmorphology, anatomy, development and Gene Ontology branches. The association between a band (e.g. 4p16.3) and a concept (e.g. microcephaly) is assessed by the statistical overrepresentation of this concept in the abstracts relating to this band. Our method is validated using existing genome annotation resources and known chromosomal aberration maps and is further illustrated through a case study on heart disease. Our chromosomal aberration maps provide diagnostics support to clinical geneticists, aid cytogeneticists to interpret and report cytogenetic findings and support researchers interested in human gene function. The method is available as a web application, aBandApart, at http://www.esat.kuleuven.be/abandapart/. PMID:17403693

  12. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python)

    PubMed Central

    Rutllant, Josep

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value. PMID:27200191

  13. Leveraging Comparative Genomics to Identify and Functionally Characterize Genes Associated with Sperm Phenotypes in Python bivittatus (Burmese Python).

    PubMed

    Irizarry, Kristopher J L; Rutllant, Josep

    2016-01-01

    Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value.

  14. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

    PubMed Central

    Li, Li; Stoeckert, Christian J.; Roos, David S.

    2003-01-01

    The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. PMID:12952885

  15. Creating reference gene annotation for the mouse C57BL6/J genome assembly.

    PubMed

    Mudge, Jonathan M; Harrow, Jennifer

    2015-10-01

    Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.

  16. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  18. Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae

    PubMed Central

    Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim; Krogsgaard, Steen; Nielsen, Jens

    2008-01-01

    Background Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading to a genome scale metabolic model of A. oryzae. Results Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources. Conclusion A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology. PMID:18500999

  19. MEETING: Chlamydomonas Annotation Jamboree - October 2003

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grossman, Arthur R

    2007-04-13

    Shotgun sequencing of the nuclear genome of Chlamydomonas reinhardtii (Chlamydomonas throughout) was performed at an approximate 10X coverage by JGI. Roughly half of the genome is now contained on 26 scaffolds, all of which are at least 1.6 Mb, and the coverage of the genome is ~95%. There are now over 200,000 cDNA sequence reads that we have generated as part of the Chlamydomonas genome project (Grossman, 2003; Shrager et al., 2003; Grossman et al. 2007; Merchant et al., 2007); other sequences have also been generated by the Kasuza sequence group (Asamizu et al., 1999; Asamizu et al., 2000) ormore » individual laboratories that have focused on specific genes. Shrager et al. (2003) placed the reads into distinct contigs (an assemblage of reads with overlapping nucleotide sequences), and contigs that group together as part of the same genes have been designated ACEs (assembly of contigs generated from EST information). All of the reads have also been mapped to the Chlamydomonas nuclear genome and the cDNAs and their corresponding genomic sequences have been reassembled, and the resulting assemblage is called an ACEG (an Assembly of contiguous EST sequences supported by genomic sequence) (Jain et al., 2007). Most of the unique genes or ACEGs are also represented by gene models that have been generated by the Joint Genome Institute (JGI, Walnut Creek, CA). These gene models have been placed onto the DNA scaffolds and are presented as a track on the Chlamydomonas genome browser associated with the genome portal (http://genome.jgi-psf.org/Chlre3/Chlre3.home.html). Ultimately, the meeting grant awarded by DOE has helped enormously in the development of an annotation pipeline (a set of guidelines used in the annotation of genes) and resulted in high quality annotation of over 4,000 genes; the annotators were from both Europe and the USA. Some of the people who led the annotation initiative were Arthur Grossman, Olivier Vallon, and Sabeeha Merchant (with many individual annotators from Europe and the USA). Olivier Vallon has been most active in continued input of annotation information.« less

  20. Gene annotation from scientific literature using mappings between keyword systems.

    PubMed

    Pérez, Antonio J; Perez-Iratxeta, Carolina; Bork, Peer; Thode, Guillermo; Andrade, Miguel A

    2004-09-01

    The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%). The algorithm can be publicly accessed and used for the annotation of sequences through a web server at http://www.bork.embl.de/kat

  1. KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.

    PubMed

    Laetsch, Dominik R; Blaxter, Mark L

    2017-10-05

    The field of comparative genomics is concerned with the study of similarities and differences between the information encoded in the genomes of organisms. A common approach is to define gene families by clustering protein sequences based on sequence similarity, and analyze protein cluster presence and absence in different species groups as a guide to biology. Due to the high dimensionality of these data, downstream analysis of protein clusters inferred from large numbers of species, or species with many genes, is nontrivial, and few solutions exist for transparent, reproducible, and customizable analyses. We present KinFin, a streamlined software solution capable of integrating data from common file formats and delivering aggregative annotation of protein clusters. KinFin delivers analyses based on systematic taxonomy of the species analyzed, or on user-defined, groupings of taxa, for example, sets based on attributes such as life history traits, organismal phenotypes, or competing phylogenetic hypotheses. Results are reported through graphical and detailed text output files. We illustrate the utility of the KinFin pipeline by addressing questions regarding the biology of filarial nematodes, which include parasites of veterinary and medical importance. We resolve the phylogenetic relationships between the species and explore functional annotation of proteins in clusters in key lineages and between custom taxon sets, identifying gene families of interest. KinFin can easily be integrated into existing comparative genomic workflows, and promotes transparent and reproducible analysis of clustered protein data. Copyright © 2017 Laetsch and Blaxter.

  2. Large-scale annotation of small-molecule libraries using public databases.

    PubMed

    Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

    2007-01-01

    While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.

  3. Genic insights from integrated human proteomics in GeneCards.

    PubMed

    Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron

    2016-01-01

    GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.

  4. Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  5. A GWAS meta-analysis from 5 population-based cohorts implicates ion channel genes in the pathogenesis of irritable bowel syndrome.

    PubMed

    Bonfiglio, F; Henström, M; Nag, A; Hadizadeh, F; Zheng, T; Cenit, M C; Tigchelaar, E; Williams, F; Reznichenko, A; Ek, W E; Rivera, N V; Homuth, G; Aghdassi, A A; Kacprowski, T; Männikkö, M; Karhunen, V; Bujanda, L; Rafter, J; Wijmenga, C; Ronkainen, J; Hysi, P; Zhernakova, A; D'Amato, M

    2018-04-19

    Irritable bowel syndrome (IBS) shows genetic predisposition, however, large-scale, powered gene mapping studies are lacking. We sought to exploit existing genetic (genotype) and epidemiological (questionnaire) data from a series of population-based cohorts for IBS genome-wide association studies (GWAS) and their meta-analysis. Based on questionnaire data compatible with Rome III Criteria, we identified a total of 1335 IBS cases and 9768 asymptomatic individuals from 5 independent European genotyped cohorts. Individual GWAS were carried out with sex-adjusted logistic regression under an additive model, followed by meta-analysis using the inverse variance method. Functional annotation of significant results was obtained via a computational pipeline exploiting ontology and interaction networks, and tissue-specific and gene set enrichment analyses. Suggestive GWAS signals (P ≤ 5.0 × 10 -6 ) were detected for 7 genomic regions, harboring 64 gene candidates to affect IBS risk via functional or expression changes. Functional annotation of this gene set convincingly (best FDR-corrected P = 3.1 × 10 -10 ) highlighted regulation of ion channel activity as the most plausible pathway affecting IBS risk. Our results confirm the feasibility of population-based studies for gene-discovery efforts in IBS, identify risk genes and loci to be prioritized in independent follow-ups, and pinpoint ion channels as important players and potential therapeutic targets warranting further investigation. © 2018 John Wiley & Sons Ltd.

  6. Metabolic network prediction through pairwise rational kernels.

    PubMed

    Roche-Lima, Abiel; Domaratzki, Michael; Fristensky, Brian

    2014-09-26

    Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise classification methods are supervised learning methods used to classify new pair of entities. Some of these classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that represent similarity measures between sequences or automata. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations. We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined with other simple kernels that include evolutionary information, the accuracy values have been improved, while maintaining lower construction and execution times. The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods. When we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.

  7. Towards an informative mutant phenotype for every bacterial gene

    DOE PAGES

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; ...

    2014-08-11

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less

  8. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.

    PubMed

    Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

    2013-08-01

    Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.

  9. The Plant Structure Ontology, a Unified Vocabulary of Anatomy and Morphology of a Flowering Plant1[W][OA

    PubMed Central

    Ilic, Katica; Kellogg, Elizabeth A.; Jaiswal, Pankaj; Zapata, Felipe; Stevens, Peter F.; Vincent, Leszek P.; Avraham, Shulamit; Reiser, Leonore; Pujar, Anuradha; Sachs, Martin M.; Whitman, Noah T.; McCouch, Susan R.; Schaeffer, Mary L.; Ware, Doreen H.; Stein, Lincoln D.; Rhee, Seung Y.

    2007-01-01

    Formal description of plant phenotypes and standardized annotation of gene expression and protein localization data require uniform terminology that accurately describes plant anatomy and morphology. This facilitates cross species comparative studies and quantitative comparison of phenotypes and expression patterns. A major drawback is variable terminology that is used to describe plant anatomy and morphology in publications and genomic databases for different species. The same terms are sometimes applied to different plant structures in different taxonomic groups. Conversely, similar structures are named by their species-specific terms. To address this problem, we created the Plant Structure Ontology (PSO), the first generic ontological representation of anatomy and morphology of a flowering plant. The PSO is intended for a broad plant research community, including bench scientists, curators in genomic databases, and bioinformaticians. The initial releases of the PSO integrated existing ontologies for Arabidopsis (Arabidopsis thaliana), maize (Zea mays), and rice (Oryza sativa); more recent versions of the ontology encompass terms relevant to Fabaceae, Solanaceae, additional cereal crops, and poplar (Populus spp.). Databases such as The Arabidopsis Information Resource, Nottingham Arabidopsis Stock Centre, Gramene, MaizeGDB, and SOL Genomics Network are using the PSO to describe expression patterns of genes and phenotypes of mutants and natural variants and are regularly contributing new annotations to the Plant Ontology database. The PSO is also used in specialized public databases, such as BRENDA, GENEVESTIGATOR, NASCArrays, and others. Over 10,000 gene annotations and phenotype descriptions from participating databases can be queried and retrieved using the Plant Ontology browser. The PSO, as well as contributed gene associations, can be obtained at www.plantontology.org. PMID:17142475

  10. Xander: employing a novel method for efficient gene-targeted metagenomic assembly

    DOE PAGES

    Wang, Qiong; Fish, Jordan A.; Gilman, Mariah; ...

    2015-08-05

    Here, metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility ofmore » this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. In conclusion, xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines.« less

  11. Protannotator: a semiautomated pipeline for chromosome-wise functional annotation of the "missing" human proteome.

    PubMed

    Islam, Mohammad T; Garg, Gagan; Hancock, William S; Risk, Brian A; Baker, Mark S; Ranganathan, Shoba

    2014-01-03

    The chromosome-centric human proteome project (C-HPP) aims to define the complete set of proteins encoded in each human chromosome. The neXtProt database (September 2013) lists 20,128 proteins for the human proteome, of which 3831 human proteins (∼19%) are considered "missing" according to the standard metrics table (released September 27, 2013). In support of the C-HPP initiative, we have extended the annotation strategy developed for human chromosome 7 "missing" proteins into a semiautomated pipeline to functionally annotate the "missing" human proteome. This pipeline integrates a suite of bioinformatics analysis and annotation software tools to identify homologues and map putative functional signatures, gene ontology, and biochemical pathways. From sequential BLAST searches, we have primarily identified homologues from reviewed nonhuman mammalian proteins with protein evidence for 1271 (33.2%) "missing" proteins, followed by 703 (18.4%) homologues from reviewed nonhuman mammalian proteins and subsequently 564 (14.7%) homologues from reviewed human proteins. Functional annotations for 1945 (50.8%) "missing" proteins were also determined. To accelerate the identification of "missing" proteins from proteomics studies, we generated proteotypic peptides in silico. Matching these proteotypic peptides to ENCODE proteogenomic data resulted in proteomic evidence for 107 (2.8%) of the 3831 "missing proteins, while evidence from a recent membrane proteomic study supported the existence for another 15 "missing" proteins. The chromosome-wise functional annotation of all "missing" proteins is freely available to the scientific community through our web server (http://biolinfo.org/protannotator).

  12. nGASP - the nematode genome annotation assessment project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coghlan, A; Fiedler, T J; McKay, S J

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less

  13. Active subnetwork recovery with a mechanism-dependent scoring function; with application to angiogenesis and organogenesis studies

    PubMed Central

    2013-01-01

    Background The learning active subnetworks problem involves finding subnetworks of a bio-molecular network that are active in a particular condition. Many approaches integrate observation data (e.g., gene expression) with the network topology to find candidate subnetworks. Increasingly, pathway databases contain additional annotation information that can be mined to improve prediction accuracy, e.g., interaction mechanism (e.g., transcription, microRNA, cleavage) annotations. We introduce a mechanism-based approach to active subnetwork recovery which exploits such annotations. We suggest that neighboring interactions in a network tend to be co-activated in a way that depends on the “correlation” of their mechanism annotations. e.g., neighboring phosphorylation and de-phosphorylation interactions may be more likely to be co-activated than neighboring phosphorylation and covalent bonding interactions. Results Our method iteratively learns the mechanism correlations and finds the most likely active subnetwork. We use a probabilistic graphical model with a Markov Random Field component which creates dependencies between the states (active or non-active) of neighboring interactions, that incorporates a mechanism-based component to the function. We apply a heuristic-based EM-based algorithm suitable for the problem. We validated our method’s performance using simulated data in networks downloaded from GeneGO against the same approach without the mechanism-based component, and two other existing methods. We validated our methods performance in correctly recovering (1) the true interaction states, and (2) global network properties of the original network against these other methods. We applied our method to networks generated from time-course gene expression studies in angiogenesis and lung organogenesis and validated the findings from a biological perspective against current literature. Conclusions The advantage of our mechanism-based approach is best seen in networks composed of connected regions with a large number of interactions annotated with a subset of mechanisms, e.g., a regulatory region of transcription interactions, or a cleavage cascade region. When applied to real datasets, our method recovered novel and biologically meaningful putative interactions, e.g., interactions from an integrin signaling pathway using the angiogenesis dataset, and a group of regulatory microRNA interactions in an organogenesis network. PMID:23432934

  14. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less

  15. Analysis and Functional Annotation of an Expressed Sequence Tag Collection for Tropical Crop Sugarcane

    PubMed Central

    Vettore, André L.; da Silva, Felipe R.; Kemper, Edson L.; Souza, Glaucia M.; da Silva, Aline M.; Ferro, Maria Inês T.; Henrique-Silva, Flavio; Giglioti, Éder A.; Lemos, Manoel V.F.; Coutinho, Luiz L.; Nobrega, Marina P.; Carrer, Helaine; França, Suzelei C.; Bacci, Maurício; Goldman, Maria Helena S.; Gomes, Suely L.; Nunes, Luiz R.; Camargo, Luis E.A.; Siqueira, Walter J.; Van Sluys, Marie-Anne; Thiemann, Otavio H.; Kuramae, Eiko E.; Santelli, Roberto V.; Marino, Celso L.; Targon, Maria L.P.N.; Ferro, Jesus A.; Silveira, Henrique C.S.; Marini, Danyelle C.; Lemos, Eliana G.M.; Monteiro-Vitorello, Claudia B.; Tambor, José H.M.; Carraro, Dirce M.; Roberto, Patrícia G.; Martins, Vanderlei G.; Goldman, Gustavo H.; de Oliveira, Regina C.; Truffi, Daniela; Colombo, Carlos A.; Rossi, Magdalena; de Araujo, Paula G.; Sculaccio, Susana A.; Angella, Aline; Lima, Marleide M.A.; de Rosa, Vicente E.; Siviero, Fábio; Coscrato, Virginia E.; Machado, Marcos A.; Grivet, Laurent; Di Mauro, Sonia M.Z.; Nobrega, Francisco G.; Menck, Carlos F.M.; Braga, Marilia D.V.; Telles, Guilherme P.; Cara, Frank A.A.; Pedrosa, Guilherme; Meidanis, João; Arruda, Paulo

    2003-01-01

    To contribute to our understanding of the genome complexity of sugarcane, we undertook a large-scale expressed sequence tag (EST) program. More than 260,000 cDNA clones were partially sequenced from 26 standard cDNA libraries generated from different sugarcane tissues. After the processing of the sequences, 237,954 high-quality ESTs were identified. These ESTs were assembled into 43,141 putative transcripts. Of the assembled sequences, 35.6% presented no matches with existing sequences in public databases. A global analysis of the whole SUCEST data set indicated that 14,409 assembled sequences (33% of the total) contained at least one cDNA clone with a full-length insert. Annotation of the 43,141 assembled sequences associated almost 50% of the putative identified sugarcane genes with protein metabolism, cellular communication/signal transduction, bioenergetics, and stress responses. Inspection of the translated assembled sequences for conserved protein domains revealed 40,821 amino acid sequences with 1415 Pfam domains. Reassembling the consensus sequences of the 43,141 transcripts revealed a 22% redundancy in the first assembling. This indicated that possibly 33,620 unique genes had been identified and indicated that >90% of the sugarcane expressed genes were tagged. PMID:14613979

  16. A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis

    PubMed Central

    2011-01-01

    Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). PMID:21668950

  17. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  18. Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana

    PubMed Central

    Itoh, Takeshi; Tanaka, Tsuyoshi; Barrero, Roberto A.; Yamasaki, Chisato; Fujii, Yasuyuki; Hilton, Phillip B.; Antonio, Baltazar A.; Aono, Hideo; Apweiler, Rolf; Bruskiewich, Richard; Bureau, Thomas; Burr, Frances; Costa de Oliveira, Antonio; Fuks, Galina; Habara, Takuya; Haberer, Georg; Han, Bin; Harada, Erimi; Hiraki, Aiko T.; Hirochika, Hirohiko; Hoen, Douglas; Hokari, Hiroki; Hosokawa, Satomi; Hsing, Yue; Ikawa, Hiroshi; Ikeo, Kazuho; Imanishi, Tadashi; Ito, Yukiyo; Jaiswal, Pankaj; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Kawashima, Hiroaki; Khurana, Jitendra P.; Kikuchi, Shoshi; Komatsu, Setsuko; Koyanagi, Kanako O.; Kubooka, Hiromi; Lieberherr, Damien; Lin, Yao-Cheng; Lonsdale, David; Matsumoto, Takashi; Matsuya, Akihiro; McCombie, W. Richard; Messing, Joachim; Miyao, Akio; Mulder, Nicola; Nagamura, Yoshiaki; Nam, Jongmin; Namiki, Nobukazu; Numa, Hisataka; Nurimoto, Shin; O’Donovan, Claire; Ohyanagi, Hajime; Okido, Toshihisa; OOta, Satoshi; Osato, Naoki; Palmer, Lance E.; Quetier, Francis; Raghuvanshi, Saurabh; Saichi, Naomi; Sakai, Hiroaki; Sakai, Yasumichi; Sakata, Katsumi; Sakurai, Tetsuya; Sato, Fumihiko; Sato, Yoshiharu; Schoof, Heiko; Seki, Motoaki; Shibata, Michie; Shimizu, Yuji; Shinozaki, Kazuo; Shinso, Yuji; Singh, Nagendra K.; Smith-White, Brian; Takeda, Jun-ichi; Tanino, Motohiko; Tatusova, Tatiana; Thongjuea, Supat; Todokoro, Fusano; Tsugane, Mika; Tyagi, Akhilesh K.; Vanavichit, Apichart; Wang, Aihui; Wing, Rod A.; Yamaguchi, Kaori; Yamamoto, Mayu; Yamamoto, Naoyuki; Yu, Yeisoo; Zhang, Hao; Zhao, Qiang; Higo, Kenichi; Burr, Benjamin; Gojobori, Takashi; Sasaki, Takuji

    2007-01-01

    We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ∼32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene. PMID:17210932

  19. Multi-label literature classification based on the Gene Ontology graph.

    PubMed

    Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

    2008-12-08

    The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

  20. Genome-wide annotation of the soybean WRKY family and functional characterization of genes involved in response to Phakopsora pachyrhizi infection.

    PubMed

    Bencke-Malato, Marta; Cabreira, Caroline; Wiebke-Strohm, Beatriz; Bücker-Neto, Lauro; Mancini, Estefania; Osorio, Marina B; Homrich, Milena S; Turchetto-Zolet, Andreia Carina; De Carvalho, Mayra C C G; Stolf, Renata; Weber, Ricardo L M; Westergaard, Gastón; Castagnaro, Atílio P; Abdelnoor, Ricardo V; Marcelino-Guimarães, Francismar C; Margis-Pinheiro, Márcia; Bodanese-Zanettini, Maria Helena

    2014-09-10

    Many previous studies have shown that soybean WRKY transcription factors are involved in the plant response to biotic and abiotic stresses. Phakopsora pachyrhizi is the causal agent of Asian Soybean Rust, one of the most important soybean diseases. There are evidences that WRKYs are involved in the resistance of some soybean genotypes against that fungus. The number of WRKY genes already annotated in soybean genome was underrepresented. In the present study, a genome-wide annotation of the soybean WRKY family was carried out and members involved in the response to P. pachyrhizi were identified. As a result of a soybean genomic databases search, 182 WRKY-encoding genes were annotated and 33 putative pseudogenes identified. Genes involved in the response to P. pachyrhizi infection were identified using superSAGE, RNA-Seq of microdissected lesions and microarray experiments. Seventy-five genes were differentially expressed during fungal infection. The expression of eight WRKY genes was validated by RT-qPCR. The expression of these genes in a resistant genotype was earlier and/or stronger compared with a susceptible genotype in response to P. pachyrhizi infection. Soybean somatic embryos were transformed in order to overexpress or silence WRKY genes. Embryos overexpressing a WRKY gene were obtained, but they were unable to convert into plants. When infected with P. pachyrhizi, the leaves of the silenced transgenic line showed a higher number of lesions than the wild-type plants. The present study reports a genome-wide annotation of soybean WRKY family. The participation of some members in response to P. pachyrhizi infection was demonstrated. The results contribute to the elucidation of gene function and suggest the manipulation of WRKYs as a strategy to increase fungal resistance in soybean plants.

  1. Metabolic Pathway Assignment of Plant Genes based on Phylogenetic Profiling–A Feasibility Study

    PubMed Central

    Weißenborn, Sandra; Walther, Dirk

    2017-01-01

    Despite many developed experimental and computational approaches, functional gene annotation remains challenging. With the rapidly growing number of sequenced genomes, the concept of phylogenetic profiling, which predicts functional links between genes that share a common co-occurrence pattern across different genomes, has gained renewed attention as it promises to annotate gene functions based on presence/absence calls alone. We applied phylogenetic profiling to the problem of metabolic pathway assignments of plant genes with a particular focus on secondary metabolism pathways. We determined phylogenetic profiles for 40,960 metabolic pathway enzyme genes with assigned EC numbers from 24 plant species based on sequence and pathway annotation data from KEGG and Ensembl Plants. For gene sequence family assignments, needed to determine the presence or absence of particular gene functions in the given plant species, we included data of all 39 species available at the Ensembl Plants database and established gene families based on pairwise sequence identities and annotation information. Aside from performing profiling comparisons, we used machine learning approaches to predict pathway associations from phylogenetic profiles alone. Selected metabolic pathways were indeed found to be composed of gene families of greater than expected phylogenetic profile similarity. This was particularly evident for primary metabolism pathways, whereas for secondary pathways, both the available annotation in different species as well as the abstraction of functional association via distinct pathways proved limiting. While phylogenetic profile similarity was generally not found to correlate with gene co-expression, direct physical interactions of proteins were reflected by a significantly increased profile similarity suggesting an application of phylogenetic profiling methods as a filtering step in the identification of protein-protein interactions. This feasibility study highlights the potential and challenges associated with phylogenetic profiling methods for the detection of functional relationships between genes as well as the need to enlarge the set of plant genes with proven secondary metabolism involvement as well as the limitations of distinct pathways as abstractions of relationships between genes. PMID:29163570

  2. The language of gene ontology: a Zipf's law analysis.

    PubMed

    Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy

    2012-06-07

    Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

  3. PathFinder: reconstruction and dynamic visualization of metabolic pathways.

    PubMed

    Goesmann, Alexander; Haubrock, Martin; Meyer, Folker; Kalinowski, Jörn; Giegerich, Robert

    2002-01-01

    Beyond methods for a gene-wise annotation and analysis of sequenced genomes new automated methods for functional analysis on a higher level are needed. The identification of realized metabolic pathways provides valuable information on gene expression and regulation. Detection of incomplete pathways helps to improve a constantly evolving genome annotation or discover alternative biochemical pathways. To utilize automated genome analysis on the level of metabolic pathways new methods for the dynamic representation and visualization of pathways are needed. PathFinder is a tool for the dynamic visualization of metabolic pathways based on annotation data. Pathways are represented as directed acyclic graphs, graph layout algorithms accomplish the dynamic drawing and visualization of the metabolic maps. A more detailed analysis of the input data on the level of biochemical pathways helps to identify genes and detect improper parts of annotations. As an Relational Database Management System (RDBMS) based internet application PathFinder reads a list of EC-numbers or a given annotation in EMBL- or Genbank-format and dynamically generates pathway graphs.

  4. Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs

    PubMed Central

    Takeda, Jun-ichi; Suzuki, Yutaka; Nakao, Mitsuteru; Barrero, Roberto A.; Koyanagi, Kanako O.; Jin, Lihua; Motono, Chie; Hata, Hiroko; Isogai, Takao; Nagai, Keiichi; Otsuki, Tetsuji; Kuryshev, Vladimir; Shionyu, Masafumi; Yura, Kei; Go, Mitiko; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Wiemann, Stefan; Nomura, Nobuo; Sugano, Sumio; Gojobori, Takashi; Imanishi, Tadashi

    2006-01-01

    We report the first genome-wide identification and characterization of alternative splicing in human gene transcripts based on analysis of the full-length cDNAs. Applying both manual and computational analyses for 56 419 completely sequenced and precisely annotated full-length cDNAs selected for the H-Invitational human transcriptome annotation meetings, we identified 6877 alternative splicing genes with 18 297 different alternative splicing variants. A total of 37 670 exons were involved in these alternative splicing events. The encoded protein sequences were affected in 6005 of the 6877 genes. Notably, alternative splicing affected protein motifs in 3015 genes, subcellular localizations in 2982 genes and transmembrane domains in 1348 genes. We also identified interesting patterns of alternative splicing, in which two distinct genes seemed to be bridged, nested or having overlapping protein coding sequences (CDSs) of different reading frames (multiple CDS). In these cases, completely unrelated proteins are encoded by a single locus. Genome-wide annotations of alternative splicing, relying on full-length cDNAs, should lay firm groundwork for exploring in detail the diversification of protein function, which is mediated by the fast expanding universe of alternative splicing variants. PMID:16914452

  5. A transversal approach to predict gene product networks from ontology-based similarity

    PubMed Central

    Chabalier, Julie; Mosser, Jean; Burgun, Anita

    2007-01-01

    Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression. PMID:17605807

  6. Common data model for natural language processing based on two existing standard information models: CDA+GrAF.

    PubMed

    Meystre, Stéphane M; Lee, Sanghoon; Jung, Chai Young; Chevrier, Raphaël D

    2012-08-01

    An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled "CDA+GrAF". We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and "translating" annotations between different NLP applications, and eventually "plug-and-play" of different modules in NLP applications. Copyright © 2011 Elsevier Inc. All rights reserved.

  7. WordCluster: detecting clusters of DNA words and genomic elements

    PubMed Central

    2011-01-01

    Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes. PMID:21261981

  8. Prior knowledge based mining functional modules from Yeast PPI networks with gene ontology

    PubMed Central

    2010-01-01

    Background In the literature, there are fruitful algorithmic approaches for identification functional modules in protein-protein interactions (PPI) networks. Because of accumulation of large-scale interaction data on multiple organisms and non-recording interaction data in the existing PPI database, it is still emergent to design novel computational techniques that can be able to correctly and scalably analyze interaction data sets. Indeed there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. Results The main aim of this paper is to present a prior knowledge based mining strategy to identify functional modules from PPI networks with the aid of Gene Ontology. Higher similarity value in Gene Ontology means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. We study (i) to encode the functional pairs into the existing PPI networks; and (ii) to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms. Topology-based modularity metric and complex annotation in MIPs will be used to evaluate the identified functional modules by these two approaches. Conclusions The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms. PMID:21172053

  9. MalaCards: an integrated compendium for diseases and their annotation

    PubMed Central

    Rappaport, Noa; Nativ, Noam; Stelzer, Gil; Twik, Michal; Guan-Golan, Yaron; Iny Stein, Tsippi; Bahir, Iris; Belinky, Frida; Morrey, C. Paul; Safran, Marilyn; Lancet, Doron

    2013-01-01

    Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16 919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches and GeneDecks set analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we use four schemes to populate MalaCards sections: (i) directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests and anatomical context; (ii) searching GeneCards for related publications, and for associated genes with corresponding relevance scores; (iii) analyzing disease-associated gene sets in GeneDecks to yield affiliated pathways, phenotypes, compounds and GO terms, sorted by a composite relevance score and presented with GeneCards links; and (iv) searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, suggesting that this might be an inherent property of such networks. Work in progress includes hierarchical malady classification, ontological mapping and disease set analyses, striving to make MalaCards an even more effective tool for biomedical research. Database URL: http://www.malacards.org/ PMID:23584832

  10. Xander: employing a novel method for efficient gene-targeted metagenomic assembly.

    PubMed

    Wang, Qiong; Fish, Jordan A; Gilman, Mariah; Sun, Yanni; Brown, C Titus; Tiedje, James M; Cole, James R

    2015-01-01

    Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility of this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. Xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines. This method is implemented as open source software and is available at https://github.com/rdpstaff/Xander_assembler.

  11. ExpTreeDB: web-based query and visualization of manually annotated gene expression profiling experiments of human and mouse from GEO.

    PubMed

    Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen

    2014-12-01

    Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  12. Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome

    PubMed Central

    Stolc, Viktor; Deng, Wei; He, Hang; Korbel, Jan; Chen, Xuewei; Tongprasit, Waraporn; Ronald, Pamela; Chen, Runsheng; Gerstein, Mark; Wang Deng, Xing

    2007-01-01

    Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome. PMID:17372628

  13. Genome-wide transcription start site profiling in biofilm-grown Burkholderia cenocepacia J2315.

    PubMed

    Sass, Andrea M; Van Acker, Heleen; Förstner, Konrad U; Van Nieuwerburgh, Filip; Deforce, Dieter; Vogel, Jörg; Coenye, Tom

    2015-10-13

    Burkholderia cenocepacia is a soil-dwelling Gram-negative Betaproteobacterium with an important role as opportunistic pathogen in humans. Infections with B. cenocepacia are very difficult to treat due to their high intrinsic resistance to most antibiotics. Biofilm formation further adds to their antibiotic resistance. B. cenocepacia harbours a large, multi-replicon genome with a high GC-content, the reference genome of strain J2315 includes 7374 annotated genes. This study aims to annotate transcription start sites and identify novel transcripts on a whole genome scale. RNA extracted from B. cenocepacia J2315 biofilms was analysed by differential RNA-sequencing and the resulting dataset compared to data derived from conventional, global RNA-sequencing. Transcription start sites were annotated and further analysed according to their position relative to annotated genes. Four thousand ten transcription start sites were mapped over the whole B. cenocepacia genome and the primary transcription start site of 2089 genes expressed in B. cenocepacia biofilms were defined. For 64 genes a start codon alternative to the annotated one was proposed. Substantial antisense transcription for 105 genes and two novel protein coding sequences were identified. The distribution of internal transcription start sites can be used to identify genomic islands in B. cenocepacia. A potassium pump strongly induced only under biofilm conditions was found and 15 non-coding small RNAs highly expressed in biofilms were discovered. Mapping transcription start sites across the B. cenocepacia genome added relevant information to the J2315 annotation. Genes and novel regulatory RNAs putatively involved in B. cenocepacia biofilm formation were identified. These findings will help in understanding regulation of B. cenocepacia biofilm formation.

  14. Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA

    PubMed Central

    Djebali, Sarah; Delaplace, Franck; Crollius, Hugues Roest

    2006-01-01

    Background Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism. Results We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts. Conclusion We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement. PMID:16925841

  15. eRAM: encyclopedia of rare disease annotations for precision medicine.

    PubMed

    Jia, Jinmeng; An, Zhongxin; Ming, Yue; Guo, Yongli; Li, Wei; Liang, Yunxiang; Guo, Dongming; Li, Xin; Tai, Jun; Chen, Geng; Jin, Yaqiong; Liu, Zhimei; Ni, Xin; Shi, Tieliu

    2018-01-04

    Rare diseases affect over a hundred million people worldwide, most of these patients are not accurately diagnosed and effectively treated. The limited knowledge of rare diseases forms the biggest obstacle for improving their treatment. Detailed clinical phenotyping is considered as a keystone of deciphering genes and realizing the precision medicine for rare diseases. Here, we preset a standardized system for various types of rare diseases, called encyclopedia of Rare disease Annotations for Precision Medicine (eRAM). eRAM was built by text-mining nearly 10 million scientific publications and electronic medical records, and integrating various data in existing recognized databases (such as Unified Medical Language System (UMLS), Human Phenotype Ontology, Orphanet, OMIM, GWAS). eRAM systematically incorporates currently available data on clinical manifestations and molecular mechanisms of rare diseases and uncovers many novel associations among diseases. eRAM provides enriched annotations for 15 942 rare diseases, yielding 6147 human disease related phenotype terms, 31 661 mammalians phenotype terms, 10,202 symptoms from UMLS, 18 815 genes and 92 580 genotypes. eRAM can not only provide information about rare disease mechanism but also facilitate clinicians to make accurate diagnostic and therapeutic decisions towards rare diseases. eRAM can be freely accessed at http://www.unimd.org/eram/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. Using Comparative Genomics for Inquiry-Based Learning to Dissect Virulence of Escherichia coli O157:H7 and Yersinia pestis

    PubMed Central

    Baumler, David J.; Banta, Lois M.; Hung, Kai F.; Schwarz, Jodi A.; Cabot, Eric L.; Glasner, Jeremy D.; Perna, Nicole T.

    2012-01-01

    Genomics and bioinformatics are topics of increasing interest in undergraduate biological science curricula. Many existing exercises focus on gene annotation and analysis of a single genome. In this paper, we present two educational modules designed to enable students to learn and apply fundamental concepts in comparative genomics using examples related to bacterial pathogenesis. Students first examine alignments of genomes of Escherichia coli O157:H7 strains isolated from three food-poisoning outbreaks using the multiple-genome alignment tool Mauve. Students investigate conservation of virulence factors using the Mauve viewer and by browsing annotations available at the A Systematic Annotation Package for Community Analysis of Genomes database. In the second module, students use an alignment of five Yersinia pestis genomes to analyze single-nucleotide polymorphisms of three genes to classify strains into biovar groups. Students are then given sequences of bacterial DNA amplified from the teeth of corpses from the first and second pandemics of the bubonic plague and asked to classify these new samples. Learning-assessment results reveal student improvement in self-efficacy and content knowledge, as well as students' ability to use BLAST to identify genomic islands and conduct analyses of virulence factors from E. coli O157:H7 or Y. pestis. Each of these educational modules offers educators new ready-to-implement resources for integrating comparative genomic topics into their curricula. PMID:22383620

  17. Initial description of primate-specific cystine-knot Prometheus genes and differential gene expansions of D-dopachrome tautomerase genes

    PubMed Central

    Premzl, Marko

    2015-01-01

    Using eutherian comparative genomic analysis protocol and public genomic sequence data sets, the present work attempted to update and revise two gene data sets. The most comprehensive third party annotation gene data sets of eutherian adenohypophysis cystine-knot genes (128 complete coding sequences), and d-dopachrome tautomerases and macrophage migration inhibitory factor genes (30 complete coding sequences) were annotated. For example, the present study first described primate-specific cystine-knot Prometheus genes, as well as differential gene expansions of D-dopachrome tautomerase genes. Furthermore, new frameworks of future experiments of two eutherian gene data sets were proposed. PMID:25941635

  18. BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data

    PubMed Central

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  19. Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kuo, Alan; Grigoriev, Igor

    2009-04-17

    Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentousmore » ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.« less

  20. Evolutionary interrogation of human biology in well-annotated genomic framework of rhesus macaque.

    PubMed

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-05-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.

  1. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks.

    PubMed

    Blatti, Charles; Sinha, Saurabh

    2016-07-15

    Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or 'properties' such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. blatti@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  2. The what, where, how and why of gene ontology—a primer for bioinformaticians

    PubMed Central

    du Plessis, Louis; Škunca, Nives

    2011-01-01

    With high-throughput technologies providing vast amounts of data, it has become more important to provide systematic, quality annotations. The Gene Ontology (GO) project is the largest resource for cataloguing gene function. Nonetheless, its use is not yet ubiquitous and is still fraught with pitfalls. In this review, we provide a short primer to the GO for bioinformaticians. We summarize important aspects of the structure of the ontology, describe sources and types of functional annotations, survey measures of GO annotation similarity, review typical uses of GO and discuss other important considerations pertaining to the use of GO in bioinformatics applications. PMID:21330331

  3. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries

    PubMed Central

    Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P

    2008-01-01

    Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects. PMID:18402700

  4. Sequencing, annotation and comparative analysis of nine BACs of giant panda (Ailuropoda melanoleuca).

    PubMed

    Zheng, Yang; Cai, Jing; Li, JianWen; Li, Bo; Lin, Runmao; Tian, Feng; Wang, XiaoLing; Wang, Jun

    2010-01-01

    A 10-fold BAC library for giant panda was constructed and nine BACs were selected to generate finish sequences. These BACs could be used as a validation resource for the de novo assembly accuracy of the whole genome shotgun sequencing reads of giant panda newly generated by the Illumina GA sequencing technology. Complete sanger sequencing, assembly, annotation and comparative analysis were carried out on the selected BACs of a joint length 878 kb. Homologue search and de novo prediction methods were used to annotate genes and repeats. Twelve protein coding genes were predicted, seven of which could be functionally annotated. The seven genes have an average gene size of about 41 kb, an average coding size of about 1.2 kb and an average exon number of 6 per gene. Besides, seven tRNA genes were found. About 27 percent of the BAC sequence is composed of repeats. A phylogenetic tree was constructed using neighbor-join algorithm across five species, including giant panda, human, dog, cat and mouse, which reconfirms dog as the most related species to giant panda. Our results provide detailed sequence and structure information for new genes and repeats of giant panda, which will be helpful for further studies on the giant panda.

  5. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries.

    PubMed

    Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P

    2008-04-10

    Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects.

  6. Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea.

    PubMed

    Veloso, Felipe; Riadi, Gonzalo; Aliaga, Daniela; Lieph, Ryan; Holmes, David S

    2005-01-01

    Analysis of over 300,000 annotated genes in 105 bacterial and archaeal genomes reveals an unexpectedly high frequency of large (>300 nucleotides) alternate open reading frames (ORFs). Especially notable is the very high frequency of alternate ORFs in frames +3 and -1 (where the annotated gene is defined as frame +1). The occurrence of alternate ORFs is correlated with genomic G+C content and is strongly influenced by synonymous codon usage bias. The frequency of alternate ORFs in frame -1 is also influenced by the occurrence of codons encoding leucine and serine in frame +1. Although some alternate ORFs have been shown to encode proteins, many others are probably not expressed because they lack appropriate signals for transcription and translation. These latter can be mis-annotated by automatic gene finding programs leading to errors in public databases. Especially prone to mis-annotation is frame -1, because it exhibits a potential codon usage and theoretical capacity to encode proteins with an amino acid composition most similar to real genes. Some alternate ORFs are conserved across bacterial or archaeal species, and can give rise to misannotated "conserved hypothetical" genes, while others are unique to a genome and are misidentified as "hypothetical orphan" genes, contributing significantly to the orphan gene paradox.

  7. Gene coexpression measures in large heterogeneous samples using count statistics.

    PubMed

    Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan

    2014-11-18

    With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.

  8. Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs

    NASA Astrophysics Data System (ADS)

    Saha, Barna; Hoch, Allison; Khuller, Samir; Raschid, Louiqa; Zhang, Xiao-Ning

    In this paper, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time. In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR). A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.

  9. Microarray data mining using Bioconductor packages.

    PubMed

    Nie, Haisheng; Neerincx, Pieter B T; van der Poel, Jan; Ferrari, Francesco; Bicciato, Silvio; Leunissen, Jack A M; Groenen, Martien A M

    2009-07-16

    This paper describes the results of a Gene Ontology (GO) term enrichment analysis of chicken microarray data using the Bioconductor packages. By checking the enriched GO terms in three contrasts, MM8-PM8, MM8-MA8, and MM8-MM24, of the provided microarray data during this workshop, this analysis aimed to investigate the host reactions in chickens occurring shortly after a secondary challenge with either a homologous or heterologous species of Eimeria. The results of GO enrichment analysis using GO terms annotated to chicken genes and GO terms annotated to chicken-human orthologous genes were also compared. Furthermore, a locally adaptive statistical procedure (LAP) was performed to test differentially expressed chromosomal regions, rather than individual genes, in the chicken genome after Eimeria challenge. GO enrichment analysis identified significant (raw p-value < 0.05) GO terms for all three contrasts included in the analysis. Some of the GO terms linked to, generally, primary immune responses or secondary immune responses indicating the GO enrichment analysis is a useful approach to analyze microarray data. The comparisons of GO enrichment results using chicken gene information and chicken-human orthologous gene information showed more refined GO terms related to immune responses when using chicken-human orthologous gene information, this suggests that using chicken-human orthologous gene information has higher power to detect significant GO terms with more refined functionality. Furthermore, three chromosome regions were identified to be significantly up-regulated in contrast MM8-PM8 (q-value < 0.01). Overall, this paper describes a practical approach to analyze microarray data in farm animals where the genome information is still incomplete. For farm animals, such as chicken, with currently limited gene annotation, borrowing gene annotation information from orthologous genes in well-annotated species, such as human, will help improve the pathway analysis results substantially. Furthermore, LAP analysis approach is a relatively new and very useful way to be applied in microarray analysis.

  10. COGNATE: comparative gene annotation characterizer.

    PubMed

    Wilbrandt, Jeanne; Misof, Bernhard; Niehuis, Oliver

    2017-07-17

    The comparison of gene and genome structures across species has the potential to reveal major trends of genome evolution. However, such a comparative approach is currently hampered by a lack of standardization (e.g., Elliott TA, Gregory TR, Philos Trans Royal Soc B: Biol Sci 370:20140331, 2015). For example, testing the hypothesis that the total amount of coding sequences is a reliable measure of potential proteome diversity (Wang M, Kurland CG, Caetano-Anollés G, PNAS 108:11954, 2011) requires the application of standardized definitions of coding sequence and genes to create both comparable and comprehensive data sets and corresponding summary statistics. However, such standard definitions either do not exist or are not consistently applied. These circumstances call for a standard at the descriptive level using a minimum of parameters as well as an undeviating use of standardized terms, and for software that infers the required data under these strict definitions. The acquisition of a comprehensive, descriptive, and standardized set of parameters and summary statistics for genome publications and further analyses can thus greatly benefit from the availability of an easy to use standard tool. We developed a new open-source command-line tool, COGNATE (Comparative Gene Annotation Characterizer), which uses a given genome assembly and its annotation of protein-coding genes for a detailed description of the respective gene and genome structure parameters. Additionally, we revised the standard definitions of gene and genome structures and provide the definitions used by COGNATE as a working draft suggestion for further reference. Complete parameter lists and summary statistics are inferred using this set of definitions to allow down-stream analyses and to provide an overview of the genome and gene repertoire characteristics. COGNATE is written in Perl and freely available at the ZFMK homepage ( https://www.zfmk.de/en/COGNATE ) and on github ( https://github.com/ZFMK/COGNATE ). The tool COGNATE allows comparing genome assemblies and structural elements on multiples levels (e.g., scaffold or contig sequence, gene). It clearly enhances comparability between analyses. Thus, COGNATE can provide the important standardization of both genome and gene structure parameter disclosure as well as data acquisition for future comparative analyses. With the establishment of comprehensive descriptive standards and the extensive availability of genomes, an encompassing database will become possible.

  11. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout

    USDA-ARS?s Scientific Manuscript database

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complimented by transcriptome information that will enhance genome assembly and annotation. Previously, we reported a transcriptome reference sequence using a 19X coverage of Sanger and 454-pyrosequencing dat...

  12. SNPdbe: constructing an nsSNP functional impacts database.

    PubMed

    Schaefer, Christian; Meier, Alice; Rost, Burkhard; Bromberg, Yana

    2012-02-15

    Many existing databases annotate experimentally characterized single nucleotide polymorphisms (SNPs). Each non-synonymous SNP (nsSNP) changes one amino acid in the gene product (single amino acid substitution;SAAS). This change can either affect protein function or be neutral in that respect. Most polymorphisms lack experimental annotation of their functional impact. Here, we introduce SNPdbe-SNP database of effects, with predictions of computationally annotated functional impacts of SNPs. Database entries represent nsSNPs in dbSNP and 1000 Genomes collection, as well as variants from UniProt and PMD. SAASs come from >2600 organisms; 'human' being the most prevalent. The impact of each SAAS on protein function is predicted using the SNAP and SIFT algorithms and augmented with experimentally derived function/structure information and disease associations from PMD, OMIM and UniProt. SNPdbe is consistently updated and easily augmented with new sources of information. The database is available as an MySQL dump and via a web front end that allows searches with any combination of organism names, sequences and mutation IDs. http://www.rostlab.org/services/snpdbe.

  13. TubercuList--10 years after.

    PubMed

    Lew, Jocelyne M; Kapopoulou, Adamandia; Jones, Louis M; Cole, Stewart T

    2011-01-01

    TubercuList (http://tuberculist.epfl.ch/), the relational database that presents genome-derived information about H37Rv, the paradigm strain of Mycobacterium tuberculosis, has been active for ten years and now presents its twentieth release. Here, we describe some of the recent changes that have resulted from manual annotation with information from the scientific literature. Through manual curation, TubercuList strives to provide current gene-based information and is thus distinguished from other online sources of genome sequence data for M. tuberculosis. New, mostly small, genes have been discovered and the coordinates of some existing coding sequences have been changed when bioinformatics or experimental data suggest that this is required. Nucleotides that are polymorphic between different sources of H37Rv are annotated and gene essentiality data have been updated. A host of functional information has been gleaned from the literature and many new activities of proteins and RNAs have been included. To facilitate basic and translational research, TubercuList also provides links to other specialized databases that present diverse datasets such as 3D-structures, expression profiles, drug development criteria and drug resistance information, in addition to direct access to PubMed articles pertinent to particular genes. TubercuList has been and remains a highly valuable tool for the tuberculosis research community with >75,000 visitors per month. Copyright © 2010 Elsevier Ltd. All rights reserved.

  14. A transducer for microbial sensory rhodopsin that adopts GTG as a start codon is identified in Haloarcula marismortui.

    PubMed

    Fu, Hsu-Yuan; Lu, Yen-Hsu; Yi, Hsiu-Ping; Yang, Chii-Shen

    2013-04-05

    Microbial sensory rhodopsins are known to mediate phototaxis, and all of the known sensory rhodopsins execute this function with a specific cognate transducer that has two-transmembrane (2-TM) regions. In the genome of Haloarcula marismortui, a total of six rhodopsin genes were annotated, and we previously showed three of them to be the ion type and suggested the other three as sensory type, even though the candidate transducer gene, htr, for HmSRI was missing the 2-TM region that is found in all of the other known transducers. Here we showed this htr gene featured a preceding 2-TM region when the alternative start codon GTG located 291 nucleotides upstream of the original annotated open reading frame (ORF) was introduced and it is named as htrI in this study. Overexpression of HmHtrI exhibited it existed as a membrane protein and several biophysical assays confirmed it functionally interacted with HmSRI. Together with our previous reverse-transcriptase-PCR results and phototaxis measurements, the new ORF of original predicted soluble htr gene product was a membrane protein with a 2-TM region, HmHtrI; and it serves as the cognate transducer for HmSRI. HmHtrI therefore is the first transducer for the sensory rhodopsin adopted start codon other than ATG. Copyright © 2013 Elsevier B.V. All rights reserved.

  15. Enriching regulatory networks by bootstrap learning using optimised GO-based gene similarity and gene links mined from PubMed abstracts

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Taylor, Ronald C.; Sanfilippo, Antonio P.; McDermott, Jason E.

    2011-02-18

    Transcriptional regulatory networks are being determined using “reverse engineering” methods that infer connections based on correlations in gene state. Corroboration of such networks through independent means such as evidence from the biomedical literature is desirable. Here, we explore a novel approach, a bootstrapping version of our previous Cross-Ontological Analytic method (XOA) that can be used for semi-automated annotation and verification of inferred regulatory connections, as well as for discovery of additional functional relationships between the genes. First, we use our annotation and network expansion method on a biological network learned entirely from the literature. We show how new relevant linksmore » between genes can be iteratively derived using a gene similarity measure based on the Gene Ontology that is optimized on the input network at each iteration. Second, we apply our method to annotation, verification, and expansion of a set of regulatory connections found by the Context Likelihood of Relatedness algorithm.« less

  16. Lynx web services for annotations and systems analysis of multi-gene disorders.

    PubMed

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia

    2014-07-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Issues with RNA-seq analysis in non-model organisms: A salmonid example.

    PubMed

    Sundaram, Arvind; Tengs, Torstein; Grimholt, Unni

    2017-10-01

    High throughput sequencing (HTS) is useful for many purposes as exemplified by the other topics included in this special issue. The purpose of this paper is to look into the unique challenges of using this technology in non-model organisms where resources such as genomes, functional genome annotations or genome complexity provide obstacles not met in model organisms. To describe these challenges, we narrow our scope to RNA sequencing used to study differential gene expression in response to pathogen challenge. As a demonstration species we chose Atlantic salmon, which has a sequenced genome with poor annotation and an added complexity due to many duplicated genes. We find that our RNA-seq analysis pipeline deciphers between duplicates despite high sequence identity. However, annotation issues provide problems in linking differentially expressed genes to pathways. Also, comparing results between approaches and species are complicated due to lack of standardized annotation. Copyright © 2017 Elsevier Ltd. All rights reserved.

  18. Automated Gene Ontology annotation for anonymous sequence data.

    PubMed

    Hennig, Steffen; Groth, Detlef; Lehrach, Hans

    2003-07-01

    Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.

  19. Text-mined phenotype annotation and vector-based similarity to improve identification of similar phenotypes and causative genes in monogenic disease patients.

    PubMed

    Saklatvala, Jake R; Dand, Nick; Simpson, Michael A

    2018-05-01

    The genetic diagnosis of rare monogenic diseases using exome/genome sequencing requires the true causal variant(s) to be identified from tens of thousands of observed variants. Typically a virtual gene panel approach is taken whereby only variants in genes known to cause phenotypes resembling the patient under investigation are considered. With the number of known monogenic gene-disease pairs exceeding 5,000, manual curation of personalized virtual panels using exhaustive knowledge of the genetic basis of the human monogenic phenotypic spectrum is challenging. We present improved probabilistic methods for estimating phenotypic similarity based on Human Phenotype Ontology annotation. A limitation of existing methods for evaluating a disease's similarity to a reference set is that reference diseases are typically represented as a series of binary (present/absent) observations of phenotypic terms. We evaluate a quantified disease reference set, using term frequency in phenotypic text descriptions to approximate term relevance. We demonstrate an improved ability to identify related diseases through the use of a quantified reference set, and that vector space similarity measures perform better than established information content-based measures. These improvements enable the generation of bespoke virtual gene panels, facilitating more accurate and efficient interpretation of genomic variant profiles from individuals with rare Mendelian disorders. These methods are available online at https://atlas.genetics.kcl.ac.uk/~jake/cgi-bin/patient_sim.py. © 2018 Wiley Periodicals, Inc.

  20. The Evolution of Human Cells in Terms of Protein Innovation

    PubMed Central

    Sardar, Adam J.; Oates, Matt E.; Fang, Hai; Forrest, Alistair R.R.; Kawaji, Hideya; Gough, Julian; Rackham, Owen J.L.

    2014-01-01

    Humans are composed of hundreds of cell types. As the genomic DNA of each somatic cell is identical, cell type is determined by what is expressed and when. Until recently, little has been reported about the determinants of human cell identity, particularly from the joint perspective of gene evolution and expression. Here, we chart the evolutionary past of all documented human cell types via the collective histories of proteins, the principal product of gene expression. FANTOM5 data provide cell-type–specific digital expression of human protein-coding genes and the SUPERFAMILY resource is used to provide protein domain annotation. The evolutionary epoch in which each protein was created is inferred by comparison with domain annotation of all other completely sequenced genomes. Studying the distribution across epochs of genes expressed in each cell type reveals insights into human cellular evolution in terms of protein innovation. For each cell type, its history of protein innovation is charted based on the genes it expresses. Combining the histories of all cell types enables us to create a timeline of cell evolution. This timeline identifies the possibility that our common ancestor Coelomata (cavity-forming animals) provided the innovation required for the innate immune system, whereas cells which now form the brain of human have followed a trajectory of continually accumulating novel proteins since Opisthokonta (boundary of animals and fungi). We conclude that exaptation of existing domain architectures into new contexts is the dominant source of cell-type–specific domain architectures. PMID:24692656

  1. Jointly characterizing epigenetic dynamics across multiple human cell types

    PubMed Central

    An, Lin; Yue, Feng; Hardison, Ross C

    2016-01-01

    Advanced sequencing technologies have generated a plethora of data for many chromatin marks in multiple tissues and cell types, yet there is lack of a generalized tool for optimal utility of those data. A major challenge is to quantitatively model the epigenetic dynamics across both the genome and many cell types for understanding their impacts on differential gene regulation and disease. We introduce IDEAS, an integrative and discriminative epigenome annotation system, for jointly characterizing epigenetic landscapes in many cell types and detecting differential regulatory regions. A key distinction between our method and existing state-of-the-art algorithms is that IDEAS integrates epigenomes of many cell types simultaneously in a way that preserves the position-dependent and cell type-specific information at fine scales, thereby greatly improving segmentation accuracy and producing comparable annotations across cell types. PMID:27095202

  2. A manually annotated Actinidia chinensis var. chinensis (kiwifruit) genome highlights the challenges associated with draft genomes and gene prediction in plants.

    PubMed

    Pilkington, Sarah M; Crowhurst, Ross; Hilario, Elena; Nardozza, Simona; Fraser, Lena; Peng, Yongyan; Gunaseelan, Kularajathevan; Simpson, Robert; Tahir, Jibran; Deroles, Simon C; Templeton, Kerry; Luo, Zhiwei; Davy, Marcus; Cheng, Canhong; McNeilage, Mark; Scaglione, Davide; Liu, Yifei; Zhang, Qiong; Datson, Paul; De Silva, Nihal; Gardiner, Susan E; Bassett, Heather; Chagné, David; McCallum, John; Dzierzon, Helge; Deng, Cecilia; Wang, Yen-Yi; Barron, Lorna; Manako, Kelvina; Bowen, Judith; Foster, Toshi M; Erridge, Zoe A; Tiffin, Heather; Waite, Chethi N; Davies, Kevin M; Grierson, Ella P; Laing, William A; Kirk, Rebecca; Chen, Xiuyin; Wood, Marion; Montefiori, Mirco; Brummell, David A; Schwinn, Kathy E; Catanach, Andrew; Fullerton, Christina; Li, Dawei; Meiyalaghan, Sathiyamoorthy; Nieuwenhuizen, Niels; Read, Nicola; Prakash, Roneel; Hunter, Don; Zhang, Huaibi; McKenzie, Marian; Knäbel, Mareike; Harris, Alastair; Allan, Andrew C; Gleave, Andrew; Chen, Angela; Janssen, Bart J; Plunkett, Blue; Ampomah-Dwamena, Charles; Voogd, Charlotte; Leif, Davin; Lafferty, Declan; Souleyre, Edwige J F; Varkonyi-Gasic, Erika; Gambi, Francesco; Hanley, Jenny; Yao, Jia-Long; Cheung, Joey; David, Karine M; Warren, Ben; Marsh, Ken; Snowden, Kimberley C; Lin-Wang, Kui; Brian, Lara; Martinez-Sanchez, Marcela; Wang, Mindy; Ileperuma, Nadeesha; Macnee, Nikolai; Campin, Robert; McAtee, Peter; Drummond, Revel S M; Espley, Richard V; Ireland, Hilary S; Wu, Rongmei; Atkinson, Ross G; Karunairetnam, Sakuntala; Bulley, Sean; Chunkath, Shayhan; Hanley, Zac; Storey, Roy; Thrimawithana, Amali H; Thomson, Susan; David, Charles; Testolin, Raffaele; Huang, Hongwen; Hellens, Roger P; Schaffer, Robert J

    2018-04-16

    Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) 'Hongyang' draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models. A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within 'Hongyang' The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned 'Hort16A' cDNAs and comparing with the predicted protein models for Red5 and both the original 'Hongyang' assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised 'Hongyang' annotation, respectively, compared with 90.9% to the Red5 models. Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.

  3. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

    PubMed Central

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

  4. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

    PubMed

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.

  5. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).

  6. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger.

    PubMed

    Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J

    2009-02-04

    Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.

  7. Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation

    PubMed Central

    Wegrzyn, Jill L.; Liechty, John D.; Stevens, Kristian A.; Wu, Le-Shin; Loopstra, Carol A.; Vasquez-Gross, Hans A.; Dougherty, William M.; Lin, Brian Y.; Zieve, Jacob J.; Martínez-García, Pedro J.; Holt, Carson; Yandell, Mark; Zimin, Aleksey V.; Yorke, James A.; Crepeau, Marc W.; Puiu, Daniela; Salzberg, Steven L.; de Jong, Pieter J.; Mockaitis, Keithanne; Main, Doreen; Langley, Charles H.; Neale, David B.

    2014-01-01

    The largest genus in the conifer family Pinaceae is Pinus, with over 100 species. The size and complexity of their genomes (∼20–40 Gb, 2n = 24) have delayed the arrival of a well-annotated reference sequence. In this study, we present the annotation of the first whole-genome shotgun assembly of loblolly pine (Pinus taeda L.), which comprises 20.1 Gb of sequence. The MAKER-P annotation pipeline combined evidence-based alignments and ab initio predictions to generate 50,172 gene models, of which 15,653 are classified as high confidence. Clustering these gene models with 13 other plant species resulted in 20,646 gene families, of which 1554 are predicted to be unique to conifers. Among the conifer gene families, 159 are composed exclusively of loblolly pine members. The gene models for loblolly pine have the highest median and mean intron lengths of 24 fully sequenced plant genomes. Conifer genomes are full of repetitive DNA, with the most significant contributions from long-terminal-repeat retrotransposons. In depth analysis of the tandem and interspersed repetitive content yielded a combined estimate of 82%. PMID:24653211

  8. First generation annotations for the fathead minnow (Pimephales promelas) genome

    EPA Science Inventory

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minno...

  9. Evolutionary characterization of pig interferon-inducible transmembrane gene family and member expression dynamics in tracheobronchial lymph nodes of pigs infected with swine respiratory disease viruses.

    PubMed

    Miller, Laura C; Jiang, Zhihua; Sang, Yongming; Harhay, Gregory P; Lager, Kelly M

    2014-06-15

    Studies have found that a cluster of duplicated gene loci encoding the interferon-inducible transmembrane proteins (IFITMs) family have antiviral activity against several viruses, including influenza A virus. The gene family has 5 and 7 members in humans and mice, respectively. Here, we confirm the current annotation of pig IFITM1, IFITM2, IFITM3, IFITM5, IFITM1L1 and IFITM1L4, manually annotated IFITM1L2, IFITM1L3, IFITM5L, IFITM3L1 and IFITM3L2, and provide expressed sequence tag (EST) and/or mRNA evidence, not contained with the NCBI Reference Sequence database (RefSeq), for the existence of IFITM6, IFITM7 and a new IFITM1-like (IFITM1LN) gene in pigs. Phylogenic analyses showed seven porcine IFITM genes with highly conserved human/mouse orthologs known to have anti-viral activity. Digital Gene Expression Tag Profiling (DGETP) of swine tracheobronchial lymph nodes (TBLN) of pigs infected with swine influenza virus (SIV), porcine pseudorabies virus, porcine reproductive and respiratory syndrome virus or porcine circovirus type 2 over 14 days post-inoculation (dpi) showed that gene expression abundance differs dramatically among pig IFITM family members, ranging from 0 to over 3000 tags per million. In particular, SIV up-regulated IFITM1 by 5.9 fold at 3 dpi. Bayesian framework further identified pig IFITM1 and IFITM3 as differentially expressed genes in the overall transcriptome analysis. In addition to being a component of protein complexes involved in homotypic adhesion, the IFITM1 is also associated with pathways related to regulation of cell proliferation and IFITM3 is involved in immune responses. Published by Elsevier B.V.

  10. Construction of an annotated corpus to support biomedical information extraction

    PubMed Central

    Thompson, Paul; Iqbal, Syed A; McNaught, John; Ananiadou, Sophia

    2009-01-01

    Background Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes. PMID:19852798

  11. Comparative Reannotation of 21 Aspergillus Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Salamov, Asaf; Riley, Robert; Kuo, Alan

    2013-03-08

    We used comparative gene modeling to reannotate 21 Aspergillus genomes. Initial automatic annotation of individual genomes may contain some errors of different nature, e.g. missing genes, incorrect exon-intron structures, 'chimeras', which fuse 2 or more real genes or alternatively splitting some real genes into 2 or more models. The main premise behind the comparative modeling approach is that for closely related genomes most orthologous families have the same conserved gene structure. The algorithm maps all gene models predicted in each individual Aspergillus genome to the other genomes and, for each locus, selects from potentially many competing models, the one whichmore » most closely resembles the orthologous genes from other genomes. This procedure is iterated until no further change in gene models is observed. For Aspergillus genomes we predicted in total 4503 new gene models ( ~;;2percent per genome), supported by comparative analysis, additionally correcting ~;;18percent of old gene models. This resulted in a total of 4065 more genes with annotated PFAM domains (~;;3percent increase per genome). Analysis of a few genomes with EST/transcriptomics data shows that the new annotation sets also have a higher number of EST-supported splice sites at exon-intron boundaries.« less

  12. De Novo Transcriptome Analysis of Two Seahorse Species (Hippocampus erectus and H. mohnikei) and the Development of Molecular Markers for Population Genetics

    PubMed Central

    Lin, Qiang; Luo, Wei; Wan, Shiming; Gao, Zexia

    2016-01-01

    Seahorse conservation has been performed utilizing various strategies for many decades, and the deeper understanding of genomic information is necessary to more efficiently protect the germplasm resources of seahorse species. However, little genetic information about seahorses currently exists in the public databases. In this study, high-throughput RNA sequencing for two seahorse species, Hippocampus erectus and H. mohnikei, was carried out, and de novo assembly generated 37,506 unigenes for H. erectus and 36,113 unigenes for H. mohnikei. Among them, 17,338 (46.23%) unigenes for H. erectus and 17,900 (49.57%) for H. mohnikei were successfully annotated based on the information available from the public databases. Through comparing the unigenes of two seahorse species, 7,802 candidate orthologous genes were identified and 5,268 genes among them could be annotated. In addition, gene ontology analysis of two species was similarly performed on biological processes, cellular components, and molecular functions. Twenty-four and twenty-one unigenes in H. erectus and H. mohnikei were annotated in the biosynthesis of unsaturated fatty acids pathways, and both seahorses lacked the Δ12 and Δ15 desaturases. Total of 8,992 and 9,116 SSR loci were obtained from H. erectus and H. mohnikei unigenes, respectively. Dozens of SSR were developed and then applied to assess the population genetic diversity, as well as cross-amplified in a related species, H. trimaculatus. The HO and HE values of the tested populations for H. erectus, H. mohnikei, and H. trimaculatus were medium. These resources would facilitate the conservation of the species through a better understanding of the genomics and comparative genome analysis within the Hippocampus genus. PMID:27128031

  13. De Novo Transcriptome Analysis of Two Seahorse Species (Hippocampus erectus and H. mohnikei) and the Development of Molecular Markers for Population Genetics.

    PubMed

    Lin, Qiang; Luo, Wei; Wan, Shiming; Gao, Zexia

    2016-01-01

    Seahorse conservation has been performed utilizing various strategies for many decades, and the deeper understanding of genomic information is necessary to more efficiently protect the germplasm resources of seahorse species. However, little genetic information about seahorses currently exists in the public databases. In this study, high-throughput RNA sequencing for two seahorse species, Hippocampus erectus and H. mohnikei, was carried out, and de novo assembly generated 37,506 unigenes for H. erectus and 36,113 unigenes for H. mohnikei. Among them, 17,338 (46.23%) unigenes for H. erectus and 17,900 (49.57%) for H. mohnikei were successfully annotated based on the information available from the public databases. Through comparing the unigenes of two seahorse species, 7,802 candidate orthologous genes were identified and 5,268 genes among them could be annotated. In addition, gene ontology analysis of two species was similarly performed on biological processes, cellular components, and molecular functions. Twenty-four and twenty-one unigenes in H. erectus and H. mohnikei were annotated in the biosynthesis of unsaturated fatty acids pathways, and both seahorses lacked the Δ12 and Δ15 desaturases. Total of 8,992 and 9,116 SSR loci were obtained from H. erectus and H. mohnikei unigenes, respectively. Dozens of SSR were developed and then applied to assess the population genetic diversity, as well as cross-amplified in a related species, H. trimaculatus. The HO and HE values of the tested populations for H. erectus, H. mohnikei, and H. trimaculatus were medium. These resources would facilitate the conservation of the species through a better understanding of the genomics and comparative genome analysis within the Hippocampus genus.

  14. Tissue-specific Proteogenomic Analysis of Plutella xylostella Larval Midgut Using a Multialgorithm Pipeline*

    PubMed Central

    Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun

    2016-01-01

    The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. PMID:26902207

  15. Transcriptome analysis reveals the time of the fourth round of genome duplication in common carp (Cyprinus carpio)

    PubMed Central

    2012-01-01

    Background Common carp (Cyprinus carpio) is thought to have undergone one extra round of genome duplication compared to zebrafish. Transcriptome analysis has been used to study the existence and timing of genome duplication in species for which genome sequences are incomplete. Large-scale transcriptome data for the common carp genome should help reveal the timing of the additional duplication event. Results We have sequenced the transcriptome of common carp using 454 pyrosequencing. After assembling the 454 contigs and the published common carp sequences together, we obtained 49,669 contigs and identified genes using homology searches and an ab initio method. We identified 4,651 orthologous pairs between common carp and zebrafish and found 129,984 paralogous pairs within the common carp. An estimation of the synonymous substitution rate in the orthologous pairs indicated that common carp and zebrafish diverged 120 million years ago (MYA). We identified one round of genome duplication in common carp and estimated that it had occurred 5.6 to 11.3 MYA. In zebrafish, no genome duplication event after speciation was observed, suggesting that, compared to zebrafish, common carp had undergone an additional genome duplication event. We annotated the common carp contigs with Gene Ontology terms and KEGG pathways. Compared with zebrafish gene annotations, we found that a set of biological processes and pathways were enriched in common carp. Conclusions The assembled contigs helped us to estimate the time of the fourth-round of genome duplication in common carp. The resource that we have built as part of this study will help advance functional genomics and genome annotation studies in the future. PMID:22424280

  16. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  17. Bioinformatics for spermatogenesis: annotation of male reproduction based on proteomics

    PubMed Central

    Zhou, Tao; Zhou, Zuo-Min; Guo, Xue-Jiang

    2013-01-01

    Proteomics strategies have been widely used in the field of male reproduction, both in basic and clinical research. Bioinformatics methods are indispensable in proteomics-based studies and are used for data presentation, database construction and functional annotation. In the present review, we focus on the functional annotation of gene lists obtained through qualitative or quantitative methods, summarizing the common and male reproduction specialized proteomics databases. We introduce several integrated tools used to find the hidden biological significance from the data obtained. We further describe in detail the information on male reproduction derived from Gene Ontology analyses, pathway analyses and biomedical analyses. We provide an overview of bioinformatics annotations in spermatogenesis, from gene function to biological function and from biological function to clinical application. On the basis of recently published proteomics studies and associated data, we show that bioinformatics methods help us to discover drug targets for sperm motility and to scan for cancer-testis genes. In addition, we summarize the online resources relevant to male reproduction research for the exploration of the regulation of spermatogenesis. PMID:23852026

  18. A draft annotation and overview of the human genome

    PubMed Central

    Wright, Fred A; Lemon, William J; Zhao, Wei D; Sears, Russell; Zhuo, Degen; Wang, Jian-Ping; Yang, Hee-Yung; Baer, Troy; Stredney, Don; Spitzner, Joe; Stutz, Al; Krahe, Ralf; Yuan, Bo

    2001-01-01

    Background The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena. Results We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. Conclusions We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence. PMID:11516338

  19. The GermOnline cross-species systems browser provides comprehensive information on genes and gene products relevant for sexual reproduction.

    PubMed

    Gattiker, Alexandre; Niederhauser-Wiederkehr, Christa; Moore, James; Hermida, Leandro; Primig, Michael

    2007-01-01

    We report a novel release of the GermOnline knowledgebase covering genes relevant for the cell cycle, gametogenesis and fertility. GermOnline was extended into a cross-species systems browser including information on DNA sequence annotation, gene expression and the function of gene products. The database covers eight model organisms and Homo sapiens, for which complete genome annotation data are available. The database is now built around a sophisticated genome browser (Ensembl), our own microarray information management and annotation system (MIMAS) used to extensively describe experimental data obtained with high-density oligonucleotide microarrays (GeneChips) and a comprehensive system for online editing of database entries (MediaWiki). The RNA data include results from classical microarrays as well as tiling arrays that yield information on RNA expression levels, transcript start sites and lengths as well as exon composition. Members of the research community are solicited to help GermOnline curators keep database entries on genes and gene products complete and accurate. The database is accessible at http://www.germonline.org/.

  20. Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout

    PubMed Central

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  1. Evaluating Hierarchical Structure in Music Annotations

    PubMed Central

    McFee, Brian; Nieto, Oriol; Farbood, Morwaread M.; Bello, Juan Pablo

    2017-01-01

    Music exhibits structure at multiple scales, ranging from motifs to large-scale functional components. When inferring the structure of a piece, different listeners may attend to different temporal scales, which can result in disagreements when they describe the same piece. In the field of music informatics research (MIR), it is common to use corpora annotated with structural boundaries at different levels. By quantifying disagreements between multiple annotators, previous research has yielded several insights relevant to the study of music cognition. First, annotators tend to agree when structural boundaries are ambiguous. Second, this ambiguity seems to depend on musical features, time scale, and genre. Furthermore, it is possible to tune current annotation evaluation metrics to better align with these perceptual differences. However, previous work has not directly analyzed the effects of hierarchical structure because the existing methods for comparing structural annotations are designed for “flat” descriptions, and do not readily generalize to hierarchical annotations. In this paper, we extend and generalize previous work on the evaluation of hierarchical descriptions of musical structure. We derive an evaluation metric which can compare hierarchical annotations holistically across multiple levels. sing this metric, we investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter-annotator agreement. PMID:28824514

  2. EuroPineDB: a high-coverage web database for maritime pine transcriptome

    PubMed Central

    2011-01-01

    Background Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre-processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome. PMID:21762488

  3. Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

    PubMed Central

    Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629

  4. VideoANT: Extending Online Video Annotation beyond Content Delivery

    ERIC Educational Resources Information Center

    Hosack, Bradford

    2010-01-01

    This paper expands the boundaries of video annotation in education by outlining the need for extended interaction in online video use, identifying the challenges faced by existing video annotation tools, and introducing Video-ANT, a tool designed to create text-based annotations integrated within the time line of a video hosted online. Several…

  5. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    USDA-ARS?s Scientific Manuscript database

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  6. The root transcriptome for North American ginseng assembled and profiled across seasonal development

    PubMed Central

    2013-01-01

    Background Ginseng including North American ginseng (Panax quinquefolius L.) is one of the most widely used medicinal plants. Its success is thought to be due to a diverse collection of ginsenosides that serve as its major bioactive compounds. However, few genomic resources exist and the details concerning its various biosynthetic pathways remain poorly understood. As the root is the primary tissue harvested commercially for ginsenosides, next generation sequencing was applied to the characterization and assembly of the root transcriptome throughout seasonal development. Transcripts showing homology to ginsenoside biosynthesis enzymes were profiled in greater detail. Results RNA extracts from root samples from seven development stages of North American ginseng were subjected to 454 sequencing, filtered for quality and used in the de novo assembly of a collective root reference transcriptome consisting of 41,623 transcripts. Annotation efforts using a number of public databases resulted in detailed annotation information for 34,801 (84%) transcripts. In addition, 3,955 genes were assigned to metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes. Among our results, we found all of the known enzymes involved in the ginsenoside backbone biosynthesis and used co-expression analysis to identify a number of candidate sequences involved in the latter stages ginsenoside biosynthesis pathway. Transcript profiles suggest ginsenoside biosynthesis occurs at distinct stages of development. Conclusions The assembly generated provides a comprehensive annotated reference for future transcriptomic study of North American ginseng. A collection of putative ginsenoside biosynthesis genes were identified and candidate genes predicted from the lesser understood downstream stages of biosynthesis. Transcript expression profiles across seasonal development suggest a primary dammarane-type ginsenoside biosynthesis occurs just prior to plant senescence, with secondary ginsenoside production occurring throughout development. Data from the study provide a valuable resource for conducting future ginsenoside biosynthesis research in this important medicinal plant. PMID:23957709

  7. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities.

    PubMed

    Falk, Marni J; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T; Stassen, Alphons P M; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G; Brilhante, Virginia; Ralph, David; DaRe, Jeana T; Shelton, Robert; Terry, Sharon F; Zhang, Zhe; Copeland, William C; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2015-03-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The "Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium" is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through the use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is integrated with a diverse array of mtDNA data analysis tools that are both freestanding and incorporated into an online exome-level dataset curation and analysis resource (GEM.app) that is being optimized to support needs of the MSeqDR community. In addition, MSeqDR supports mitochondrial disease phenotyping and ontology tools, and provides variant pathogenicity assessment features that enable community review, feedback, and integration with the public ClinVar variant annotation resource. A centralized Web-based informed consent process is being developed, with implementation of a Global Unique Identifier (GUID) system to integrate data deposited on a given individual from different sources. Community-based data deposition into MSeqDR has already begun. Future efforts will enhance capabilities to incorporate phenotypic data that enhance genomic data analyses. MSeqDR will fill the existing void in bioinformatics tools and centralized knowledge that are necessary to enable efficient nuclear and mtDNA genomic data interpretation by a range of shareholders across both clinical diagnostic and research settings. Ultimately, MSeqDR is focused on empowering the global mitochondrial disease community to better define and explore mitochondrial diseases. Copyright © 2014 Elsevier Inc. All rights reserved.

  8. Mitochondrial Disease Sequence Data Resource (MSeqDR): A global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities

    PubMed Central

    Falk, Marni J.; Shen, Lishuang; Gonzalez, Michael; Leipzig, Jeremy; Lott, Marie T.; Stassen, Alphons P.M.; Diroma, Maria Angela; Navarro-Gomez, Daniel; Yeske, Philip; Bai, Renkui; Boles, Richard G.; Brilhante, Virginia; Ralph, David; DaRe, Jeana T.; Shelton, Robert; Terry, Sharon; Zhang, Zhe; Copeland, William C.; van Oven, Mannis; Prokisch, Holger; Wallace, Douglas C.; Attimonelli, Marcella; Krotoski, Danuta; Zuchner, Stephan; Gai, Xiaowu

    2014-01-01

    Success rates for genomic analyses of highly heterogeneous disorders can be greatly improved if a large cohort of patient data is assembled to enhance collective capabilities for accurate sequence variant annotation, analysis, and interpretation. Indeed, molecular diagnostics requires the establishment of robust data resources to enable data sharing that informs accurate understanding of genes, variants, and phenotypes. The “Mitochondrial Disease Sequence Data Resource (MSeqDR) Consortium” is a grass-roots effort facilitated by the United Mitochondrial Disease Foundation to identify and prioritize specific genomic data analysis needs of the global mitochondrial disease clinical and research community. A central Web portal (https://mseqdr.org) facilitates the coherent compilation, organization, annotation, and analysis of sequence data from both nuclear and mitochondrial genomes of individuals and families with suspected mitochondrial disease. This Web portal provides users with a flexible and expandable suite of resources to enable variant-, gene-, and exome-level sequence analysis in a secure, Web-based, and user-friendly fashion. Users can also elect to share data with other MSeqDR Consortium members, or even the general public, either by custom annotation tracks or through use of a convenient distributed annotation system (DAS) mechanism. A range of data visualization and analysis tools are provided to facilitate user interrogation and understanding of genomic, and ultimately phenotypic, data of relevance to mitochondrial biology and disease. Currently available tools for nuclear and mitochondrial gene analyses include an MSeqDR GBrowse instance that hosts optimized mitochondrial disease and mitochondrial DNA (mtDNA) specific annotation tracks, as well as an MSeqDR locus-specific database (LSDB) that curates variant data on more than 1,300 genes that have been implicated in mitochondrial disease and/or encode mitochondria-localized proteins. MSeqDR is integrated with a diverse array of mtDNA data analysis tools that are both freestanding and incorporated into an online exome-level dataset curation and analysis resource (GEM.app) that is being optimized to support needs of the MSeqDR community. In addition, MSeqDR supports mitochondrial disease phenotyping and ontology tools, and provides variant pathogenicity assessment features that enable community review, feedback, and integration with the public ClinVar variant annotation resource. A centralized Web-based informed consent process is being developed, with implementation of a Global Unique Identifier (GUID) system to integrate data deposited on a given individual from different sources. Community-based data deposition into MSeqDR has already begun. Future efforts will enhance capabilities to incorporate phenotypic data that enhance genomic data analyses. MSeqDR will fill the existing void in bioinformatics tools and centralized knowledge that are necessary to enable efficient nuclear and mtDNA genomic data interpretation by a range of shareholders across both clinical diagnostic and research settings. Ultimately, MSeqDR is focused on empowering the global mitochondrial disease community to better define and explore mitochondrial disease. PMID:25542617

  9. Transcriptome database resource and gene expression atlas for the rose

    PubMed Central

    2012-01-01

    Background For centuries roses have been selected based on a number of traits. Little information exists on the genetic and molecular basis that contributes to these traits, mainly because information on expressed genes for this economically important ornamental plant is scarce. Results Here, we used a combination of Illumina and 454 sequencing technologies to generate information on Rosa sp. transcripts using RNA from various tissues and in response to biotic and abiotic stresses. A total of 80714 transcript clusters were identified and 76611 peptides have been predicted among which 20997 have been clustered into 13900 protein families. BLASTp hits in closely related Rosaceae species revealed that about half of the predicted peptides in the strawberry and peach genomes have orthologs in Rosa dataset. Digital expression was obtained using RNA samples from organs at different development stages and under different stress conditions. qPCR validated the digital expression data for a selection of 23 genes with high or low expression levels. Comparative gene expression analyses between the different tissues and organs allowed the identification of clusters that are highly enriched in given tissues or under particular conditions, demonstrating the usefulness of the digital gene expression analysis. A web interface ROSAseq was created that allows data interrogation by BLAST, subsequent analysis of DNA clusters and access to thorough transcript annotation including best BLAST matches on Fragaria vesca, Prunus persica and Arabidopsis. The rose peptides dataset was used to create the ROSAcyc resource pathway database that allows access to the putative genes and enzymatic pathways. Conclusions The study provides useful information on Rosa expressed genes, with thorough annotation and an overview of expression patterns for transcripts with good accuracy. PMID:23164410

  10. Computational annotation of genes differentially expressed along olive fruit development

    PubMed Central

    Galla, Giulio; Barcaccia, Gianni; Ramina, Angelo; Collani, Silvio; Alagna, Fiammetta; Baldoni, Luciana; Cultrera, Nicolò GM; Martinelli, Federico; Sebastiani, Luca; Tonutti, Pietro

    2009-01-01

    Background Olea europaea L. is a traditional tree crop of the Mediterranean basin with a worldwide economical high impact. Differently from other fruit tree species, little is known about the physiological and molecular basis of the olive fruit development and a few sequences of genes and gene products are available for olive in public databases. This study deals with the identification of large sets of differentially expressed genes in developing olive fruits and the subsequent computational annotation by means of different software. Results mRNA from fruits of the cv. Leccino sampled at three different stages [i.e., initial fruit set (stage 1), completed pit hardening (stage 2) and veraison (stage 3)] was used for the identification of differentially expressed genes putatively involved in main processes along fruit development. Four subtractive hybridization libraries were constructed: forward and reverse between stage 1 and 2 (libraries A and B), and 2 and 3 (libraries C and D). All sequenced clones (1,132 in total) were analyzed through BlastX against non-redundant NCBI databases and about 60% of them showed similarity to known proteins. A total of 89 out of 642 differentially expressed unique sequences was further investigated by Real-Time PCR, showing a validation of the SSH results as high as 69%. Library-specific cDNA repertories were annotated according to the three main vocabularies of the gene ontology (GO): cellular component, biological process and molecular function. BlastX analysis, GO terms mapping and annotation analysis were performed using the Blast2GO software, a research tool designed with the main purpose of enabling GO based data mining on sequence sets for which no GO annotation is yet available. Bioinformatic analysis pointed out a significantly different distribution of the annotated sequences for each GO category, when comparing the three fruit developmental stages. The olive fruit-specific transcriptome dataset was used to query all known KEGG (Kyoto Encyclopaedia of Genes and Genomes) metabolic pathways for characterizing and positioning retrieved EST records. The integration of the olive sequence datasets within the MapMan platform for microarray analysis allowed the identification of specific biosynthetic pathways useful for the definition of key functional categories in time course analyses for gene groups. Conclusion The bioinformatic annotation of all gene sequences was useful to shed light on metabolic pathways and transcriptional aspects related to carbohydrates, fatty acids, secondary metabolites, transcription factors and hormones as well as response to biotic and abiotic stresses throughout olive drupe development. These results represent a first step toward both functional genomics and systems biology research for understanding the gene functions and regulatory networks in olive fruit growth and ripening. PMID:19852839

  11. 'RetinoGenetics': a comprehensive mutation database for genes related to inherited retinal degeneration.

    PubMed

    Ran, Xia; Cai, Wei-Jun; Huang, Xiu-Feng; Liu, Qi; Lu, Fan; Qu, Jia; Wu, Jinyu; Jin, Zi-Bing

    2014-01-01

    Inherited retinal degeneration (IRD), a leading cause of human blindness worldwide, is exceptionally heterogeneous with clinical heterogeneity and genetic variety. During the past decades, tremendous efforts have been made to explore the complex heterogeneity, and massive mutations have been identified in different genes underlying IRD with the significant advancement of sequencing technology. In this study, we developed a comprehensive database, 'RetinoGenetics', which contains informative knowledge about all known IRD-related genes and mutations for IRD. 'RetinoGenetics' currently contains 4270 mutations in 186 genes, with detailed information associated with 164 phenotypes from 934 publications and various types of functional annotations. Then extensive annotations were performed to each gene using various resources, including Gene Ontology, KEGG pathways, protein-protein interaction, mutational annotations and gene-disease network. Furthermore, by using the search functions, convenient browsing ways and intuitive graphical displays, 'RetinoGenetics' could serve as a valuable resource for unveiling the genetic basis of IRD. Taken together, 'RetinoGenetics' is an integrative, informative and updatable resource for IRD-related genetic predispositions. Database URL: http://www.retinogenetics.org/. © The Author(s) 2014. Published by Oxford University Press.

  12. PRGdb 3.0: a comprehensive platform for prediction and analysis of plant disease resistance genes.

    PubMed

    Osuna-Cruz, Cristina M; Paytuvi-Gallart, Andreu; Di Donato, Antimo; Sundesha, Vicky; Andolfo, Giuseppe; Aiese Cigliano, Riccardo; Sanseverino, Walter; Ercolano, Maria R

    2018-01-04

    The Plant Resistance Genes database (PRGdb; http://prgdb.org) has been redesigned with a new user interface, new sections, new tools and new data for genetic improvement, allowing easy access not only to the plant science research community but also to breeders who want to improve plant disease resistance. The home page offers an overview of easy-to-read search boxes that streamline data queries and directly show plant species for which data from candidate or cloned genes have been collected. Bulk data files and curated resistance gene annotations are made available for each plant species hosted. The new Gene Model view offers detailed information on each cloned resistance gene structure to highlight shared attributes with other genes. PRGdb 3.0 offers 153 reference resistance genes and 177 072 annotated candidate Pathogen Receptor Genes (PRGs). Compared to the previous release, the number of putative genes has been increased from 106 to 177 K from 76 sequenced Viridiplantae and algae genomes. The DRAGO 2 tool, which automatically annotates and predicts (PRGs) from DNA and amino acid with high accuracy and sensitivity, has been added. BLAST search has been implemented to offer users the opportunity to annotate and compare their own sequences. The improved section on plant diseases displays useful information linked to genes and genomes to connect complementary data and better address specific needs. Through, a revised and enlarged collection of data, the development of new tools and a renewed portal, PRGdb 3.0 engages the plant science community in developing a consensus plan to improve knowledge and strategies to fight diseases that afflict main crops and other plants. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing.

    PubMed

    Gupta, Vikas; Estrada, April D; Blakley, Ivory; Reid, Rob; Patel, Ketan; Meyer, Mason D; Andersen, Stig Uggerhøj; Brown, Allan F; Lila, Mary Ann; Loraine, Ann E

    2015-01-01

    Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable the breeding of berry varieties with enhanced health benefits. Toward this end, we annotated a previously sequenced draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up-regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. Homology-based annotation using Blast2GO and InterPro assigned Gene Ontology terms to around 15,000 genes. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators. Analysis of RNA-seq alignments identified developmentally regulated alternative splicing, promoter use, and 3' end formation. We report genome sequence, gene models, functional annotations, and RNA-Seq expression data that provide an important new resource enabling high throughput studies in blueberry.

  14. Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger

    PubMed Central

    Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J

    2009-01-01

    Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method. PMID:19193216

  15. Evidence-based gene models for structural and functional annotations of the oil palm genome.

    PubMed

    Chan, Kuang-Lim; Tatarinova, Tatiana V; Rosli, Rozana; Amiruddin, Nadzirah; Azizi, Norazah; Halim, Mohd Amin Ab; Sanusi, Nik Shazana Nik Mohd; Jayanthi, Nagappan; Ponomarenko, Petr; Triska, Martin; Solovyev, Victor; Firdaus-Raih, Mohd; Sambanthamurthi, Ravigadevi; Murphy, Denis; Low, Eng-Ti Leslie

    2017-09-08

    Oil palm is an important source of edible oil. The importance of the crop, as well as its long breeding cycle (10-12 years) has led to the sequencing of its genome in 2013 to pave the way for genomics-guided breeding. Nevertheless, the first set of gene predictions, although useful, had many fragmented genes. Classification and characterization of genes associated with traits of interest, such as those for fatty acid biosynthesis and disease resistance, were also limited. Lipid-, especially fatty acid (FA)-related genes are of particular interest for the oil palm as they specify oil yields and quality. This paper presents the characterization of the oil palm genome using different gene prediction methods and comparative genomics analysis, identification of FA biosynthesis and disease resistance genes, and the development of an annotation database and bioinformatics tools. Using two independent gene-prediction pipelines, Fgenesh++ and Seqping, 26,059 oil palm genes with transcriptome and RefSeq support were identified from the oil palm genome. These coding regions of the genome have a characteristic broad distribution of GC 3 (fraction of cytosine and guanine in the third position of a codon) with over half the GC 3 -rich genes (GC 3  ≥ 0.75286) being intronless. In comparison, only one-seventh of the oil palm genes identified are intronless. Using comparative genomics analysis, characterization of conserved domains and active sites, and expression analysis, 42 key genes involved in FA biosynthesis in oil palm were identified. For three of them, namely EgFABF, EgFABH and EgFAD3, segmental duplication events were detected. Our analysis also identified 210 candidate resistance genes in six classes, grouped by their protein domain structures. We present an accurate and comprehensive annotation of the oil palm genome, focusing on analysis of important categories of genes (GC 3 -rich and intronless), as well as those associated with important functions, such as FA biosynthesis and disease resistance. The study demonstrated the advantages of having an integrated approach to gene prediction and developed a computational framework for combining multiple genome annotations. These results, available in the oil palm annotation database ( http://palmxplore.mpob.gov.my ), will provide important resources for studies on the genomes of oil palm and related crops. This article was reviewed by Alexander Kel, Igor Rogozin, and Vladimir A. Kuznetsov.

  16. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

    PubMed Central

    2012-01-01

    Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas. PMID:23256920

  17. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

  18. Annotated Bibliography of Textbooks and Reference Materials in Marine Sciences. Provisional Edition. Intergovernmental Oceanographic Commission, Technical Series.

    ERIC Educational Resources Information Center

    United Nations Educational, Scientific, and Cultural Organization, Paris (France). Intergovernmental Oceanographic Commission.

    Presented is an annotated bibliography based on selected materials from a preliminary survey of existing bibliographies, publishers' listings, and other sources. It is intended to serve educators and researchers, especially those in countries where marine sciences are just developing. One hundred annotated and 450 non-annotated entries are…

  19. ConsPred: a rule-based (re-)annotation framework for prokaryotic genomes.

    PubMed

    Weinmaier, Thomas; Platzer, Alexander; Frank, Jeroen; Hellinger, Hans-Jörg; Tischler, Patrick; Rattei, Thomas

    2016-11-01

    The rapidly growing number of available prokaryotic genome sequences requires fully automated and high-quality software solutions for their initial and re-annotation. Here we present ConsPred, a prokaryotic genome annotation framework that performs intrinsic gene predictions, homology searches, predictions of non-coding genes as well as CRISPR repeats and integrates all evidence into a consensus annotation. ConsPred achieves comprehensive, high-quality annotations based on rules and priorities, similar to decision-making in manual curation and avoids conflicting predictions. Parameters controlling the annotation process are configurable by the user. ConsPred has been used in the institutions of the authors for longer than 5 years and can easily be extended and adapted to specific needs. The ConsPred algorithm for producing a consensus from the varying scores of multiple gene prediction programs approaches manual curation in accuracy. Its rule-based approach for choosing final predictions avoids overriding previous manual curations. ConsPred is implemented in Java, Perl and Shell and is freely available under the Creative Commons license as a stand-alone in-house pipeline or as an Amazon Machine Image for cloud computing, see https://sourceforge.net/projects/conspred/. thomas.rattei@univie.ac.atSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Improving the genome annotation of the acarbose producer Actinoplanes sp. SE50/110 by sequencing enriched 5'-ends of primary transcripts.

    PubMed

    Schwientek, Patrick; Neshat, Armin; Kalinowski, Jörn; Klein, Andreas; Rückert, Christian; Schneiker-Bekel, Susanne; Wendler, Sergej; Stoye, Jens; Pühler, Alfred

    2014-11-20

    Actinoplanes sp. SE50/110 is the producer of the alpha-glucosidase inhibitor acarbose, which is an economically relevant and potent drug in the treatment of type-2 diabetes mellitus. In this study, we present the detection of transcription start sites on this genome by sequencing enriched 5'-ends of primary transcripts. Altogether, 1427 putative transcription start sites were initially identified. With help of the annotated genome sequence, 661 transcription start sites were found to belong to the leader region of protein-coding genes with the surprising result that roughly 20% of these genes rank among the class of leaderless transcripts. Next, conserved promoter motifs were identified for protein-coding genes with and without leader sequences. The mapped transcription start sites were finally used to improve the annotation of the Actinoplanes sp. SE50/110 genome sequence. Concerning protein-coding genes, 41 translation start sites were corrected and 9 novel protein-coding genes could be identified. In addition to this, 122 previously undetermined non-coding RNA (ncRNA) genes of Actinoplanes sp. SE50/110 were defined. Focusing on antisense transcription start sites located within coding genes or their leader sequences, it was discovered that 96 of those ncRNA genes belong to the class of antisense RNA (asRNA) genes. The remaining 26 ncRNA genes were found outside of known protein-coding genes. Four chosen examples of prominent ncRNA genes, namely the transfer messenger RNA gene ssrA, the ribonuclease P class A RNA gene rnpB, the cobalamin riboswitch RNA gene cobRS, and the selenocysteine-specific tRNA gene selC, are presented in more detail. This study demonstrates that sequencing of enriched 5'-ends of primary transcripts and the identification of transcription start sites are valuable tools for advanced genome annotation of Actinoplanes sp. SE50/110 and most probably also for other bacteria. Copyright © 2014 Elsevier B.V. All rights reserved.

  1. De novo characterisation of the greenlip abalone transcriptome (Haliotis laevigata) with a focus on the heat shock protein 70 (HSP70) family.

    PubMed

    Shiel, Brett P; Hall, Nathan E; Cooke, Ira R; Robinson, Nicholas A; Strugnell, Jan M

    2015-02-01

    Abalone (Haliotis) are economically important molluscs for fisheries and aquaculture industries worldwide. Despite this, genomic resources for abalone and molluscs are still limited. Here we present a description and functional annotation of the greenlip abalone (Haliotis laevigata) transcriptome. We present a focused analysis on the heat shock protein 70 (HSP70) family of genes with putative functions affecting temperature stress and immunity. A total of ~38 million paired end Illumina reads were obtained, resulting in a Trinity assembly of 222,172 contigs with minimum length of 200 base pairs and maximum length of 33 kilobases. The 20,702 contigs were annotated with gene descriptions by BLAST. We created a program to maximise the number of functionally annotated genes, and over 10,000 contigs were assigned Gene ontologies (GO terms). By using CateGOrizer, immunity related GO terms for stressors such as heat, hypoxia, oxidative stress and wounding received the highest counts. Twenty-six contigs with homology to the HSP70 family of genes were identified. Ninety-one putative single-nucleotide polymorphisms were observed in the abalone HSP70 contigs. Eleven of these were considered non-synonymous. The annotated transcriptome described in this study will be a useful basis for future work investigating the genetic response of abalone to stress.

  2. Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset

    PubMed Central

    2012-01-01

    Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071

  3. CuGene as a tool to view and explore genomic data

    NASA Astrophysics Data System (ADS)

    Haponiuk, Michał; Pawełkowicz, Magdalena; Przybecki, Zbigniew; Nowak, Robert M.

    2017-08-01

    Integrated CuGene is an easy-to-use, open-source, on-line tool that can be used to browse, analyze, and query genomic data and annotations. It places annotation tracks beneath genome coordinate positions, allowing rapid visual correlation of different types of information. It also allows users to upload and display their own experimental results or annotation sets. An important functionality of the application is a possibility to find similarity between sequences by applying four different algorithms of different accuracy. The presented tool was tested on real genomic data and is extensively used by Polish Consortium of Cucumber Genome Sequencing.

  4. De novo transcriptome assembly and quantification reveal differentially expressed genes between soft-seed and hard-seed pomegranate (Punica granatum L.).

    PubMed

    Xue, Hui; Cao, Shangyin; Li, Haoxian; Zhang, Jie; Niu, Juan; Chen, Lina; Zhang, Fuhong; Zhao, Diguang

    2017-01-01

    Pomegranate (Punica granatum L.) belongs to Punicaceae, and is valued for its social, ecological, economic, and aesthetic values, as well as more recently for its health benefits. The 'Tunisia' variety has softer seeds and big arils that are easily swallowed. It is a widely popular fruit; however, the molecular mechanisms of the formation of hard and soft seeds is not yet clear. We conducted a de novo assembly of the seed transcriptome in P. granatum L. and revealed differential gene expression between the soft-seed and hard-seed pomegranate varieties. A total of 35.1 Gb of data were acquired in this study, including 280,881,106 raw reads. Additionally, de novo transcriptome assembly generated 132,287 transcripts and 105,743 representative unigenes; approximately 13,805 unigenes (37.7%) were longer than 1,000 bp. Using bioinformatics annotation libraries, a total of 76,806 unigenes were annotated and, among the high-quality reads, 72.63% had at least one significant match to an existing gene model. Gene expression and differentially expressed genes were analyzed. The seed formation of the two pomegranate cultivars involves lignin biosynthesis and metabolism, including some genes encoding laccase and peroxidase, WRKY, MYB, and NAC transcription factors. In the hard-seed pomegranate, lignin-related genes and cellulose synthesis-related genes were highly expressed; in soft-seed pomegranates, expression of genes related to flavonoids and programmed cell death was slightly higher. We validated selection of the identified genes using qRT-PCR. This is the first transcriptome analysis of P. granatum L. This transcription sequencing greatly enriched the pomegranate molecular database, and the high-quality SSRs generated in this study will aid the gene cloning from pomegranate in the future. It provides important insights into the molecular mechanisms underlying the formation of soft seeds in pomegranate.

  5. De novo transcriptome assembly and quantification reveal differentially expressed genes between soft-seed and hard-seed pomegranate (Punica granatum L.)

    PubMed Central

    Xue, Hui; Cao, Shangyin; Li, Haoxian; Zhang, Jie; Niu, Juan; Chen, Lina; Zhang, Fuhong; Zhao, Diguang

    2017-01-01

    Pomegranate (Punica granatum L.) belongs to Punicaceae, and is valued for its social, ecological, economic, and aesthetic values, as well as more recently for its health benefits. The ‘Tunisia’ variety has softer seeds and big arils that are easily swallowed. It is a widely popular fruit; however, the molecular mechanisms of the formation of hard and soft seeds is not yet clear. We conducted a de novo assembly of the seed transcriptome in P. granatum L. and revealed differential gene expression between the soft-seed and hard-seed pomegranate varieties. A total of 35.1 Gb of data were acquired in this study, including 280,881,106 raw reads. Additionally, de novo transcriptome assembly generated 132,287 transcripts and 105,743 representative unigenes; approximately 13,805 unigenes (37.7%) were longer than 1,000 bp. Using bioinformatics annotation libraries, a total of 76,806 unigenes were annotated and, among the high-quality reads, 72.63% had at least one significant match to an existing gene model. Gene expression and differentially expressed genes were analyzed. The seed formation of the two pomegranate cultivars involves lignin biosynthesis and metabolism, including some genes encoding laccase and peroxidase, WRKY, MYB, and NAC transcription factors. In the hard-seed pomegranate, lignin-related genes and cellulose synthesis-related genes were highly expressed; in soft-seed pomegranates, expression of genes related to flavonoids and programmed cell death was slightly higher. We validated selection of the identified genes using qRT-PCR. This is the first transcriptome analysis of P. granatum L. This transcription sequencing greatly enriched the pomegranate molecular database, and the high-quality SSRs generated in this study will aid the gene cloning from pomegranate in the future. It provides important insights into the molecular mechanisms underlying the formation of soft seeds in pomegranate. PMID:28594931

  6. FARO server: Meta-analysis of gene expression by matching gene expression signatures to a compendium of public gene expression data.

    PubMed

    Manijak, Mieszko P; Nielsen, Henrik B

    2011-06-11

    Although, systematic analysis of gene annotation is a powerful tool for interpreting gene expression data, it sometimes is blurred by incomplete gene annotation, missing expression response of key genes and secondary gene expression responses. These shortcomings may be partially circumvented by instead matching gene expression signatures to signatures of other experiments. To facilitate this we present the Functional Association Response by Overlap (FARO) server, that match input signatures to a compendium of 242 gene expression signatures, extracted from more than 1700 Arabidopsis microarray experiments. Hereby we present a publicly available tool for robust characterization of Arabidopsis gene expression experiments which can point to similar experimental factors in other experiments. The server is available at http://www.cbs.dtu.dk/services/faro/.

  7. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics

    PubMed Central

    Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R.; Williams, Angela H.; Tan, Yew-Foon; Testa, Alison C.; Gorman, Jeffrey J.; Oliver, Richard P.

    2016-01-01

    Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models. PMID:26840125

  8. A-MADMAN: Annotation-based microarray data meta-analysis tool

    PubMed Central

    Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania

    2009-01-01

    Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634

  9. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications.

    PubMed

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene-patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene-patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at http://www.patome.org/; the information is updated bimonthly.

  10. Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

    PubMed Central

    Guttman, Mitchell; Garber, Manuel; Levin, Joshua Z.; Donaghey, Julie; Robinson, James; Adiconis, Xian; Fan, Lin; Koziol, Magdalena J.; Gnirke, Andreas; Nusbaum, Chad; Rinn, John L.; Lander, Eric S.; Regev, Aviv

    2010-01-01

    RNA-Seq provides an unbiased way to study a transcriptome, including both coding and non-coding genes. To date, most RNA-Seq studies have critically depended on existing annotations, and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We apply it to mouse embryonic stem cells, neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known expressed genes. We identify substantial variation in protein-coding genes, including thousands of novel 5′-start sites, 3′-ends, and internal coding exons. We then determine the gene structures of over a thousand lincRNA and antisense loci. Our results open the way to direct experimental manipulation of thousands of non-coding RNAs, and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes. PMID:20436462

  11. Bioinformatics approaches to predict target genes from transcription factor binding data.

    PubMed

    Essebier, Alexandra; Lamprecht, Marnie; Piper, Michael; Bodén, Mikael

    2017-12-01

    Transcription factors regulate gene expression and play an essential role in development by maintaining proliferative states, driving cellular differentiation and determining cell fate. Transcription factors are capable of regulating multiple genes over potentially long distances making target gene identification challenging. Currently available experimental approaches to detect distal interactions have multiple weaknesses that have motivated the development of computational approaches. Although an improvement over experimental approaches, existing computational approaches are still limited in their application, with different weaknesses depending on the approach. Here, we review computational approaches with a focus on data dependency, cell type specificity and usability. With the aim of identifying transcription factor target genes, we apply available approaches to typical transcription factor experimental datasets. We show that approaches are not always capable of annotating all transcription factor binding sites; binding sites should be treated disparately; and a combination of approaches can increase the biological relevance of the set of genes identified as targets. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Natural Hot Spots for Gain of Multiple Resistances: Arsenic and Antibiotic Resistances in Heterotrophic, Aerobic Bacteria from Marine Hydrothermal Vent Fields

    PubMed Central

    Farias, Pedro; Espírito Santo, Christophe; Branco, Rita; Francisco, Romeu; Santos, Susana; Hansen, Lars; Sorensen, Soren

    2015-01-01

    Microorganisms are responsible for multiple antibiotic resistances that have been associated with resistance/tolerance to heavy metals, with consequences to public health. Many genes conferring these resistances are located on mobile genetic elements, easily exchanged among phylogenetically distant bacteria. The objective of the present work was to isolate arsenic-, antimonite-, and antibiotic-resistant strains and to determine the existence of plasmids harboring antibiotic/arsenic/antimonite resistance traits in phenotypically resistant strains, in a nonanthropogenically impacted environment. The hydrothermal Lucky Strike field in the Azores archipelago (North Atlantic, between 11°N and 38°N), at the Mid-Atlantic Ridge, protected under the OSPAR Convention, was sampled as a metal-rich pristine environment. A total of 35 strains from 8 different species were isolated in the presence of arsenate, arsenite, and antimonite. ACR3 and arsB genes were amplified from the sediment's total DNA, and 4 isolates also carried ACR3 genes. Phenotypic multiple resistances were found in all strains, and 7 strains had recoverable plasmids. Purified plasmids were sequenced by Illumina and assembled by EDENA V3, and contig annotation was performed using the “Rapid Annotation using the Subsystems Technology” server. Determinants of resistance to copper, zinc, cadmium, cobalt, and chromium as well as to the antibiotics β-lactams and fluoroquinolones were found in the 3 sequenced plasmids. Genes coding for heavy metal resistance and antibiotic resistance in the same mobile element were found, suggesting the possibility of horizontal gene transfer and distribution of theses resistances in the bacterial population. PMID:25636836

  13. Warehousing re-annotated cancer genes for biomarker meta-analysis.

    PubMed

    Orsini, M; Travaglione, A; Capobianco, E

    2013-07-01

    Translational research in cancer genomics assigns a fundamental role to bioinformatics in support of candidate gene prioritization with regard to both biomarker discovery and target identification for drug development. Efforts in both such directions rely on the existence and constant update of large repositories of gene expression data and omics records obtained from a variety of experiments. Users who interactively interrogate such repositories may have problems in retrieving sample fields that present limited associated information, due for instance to incomplete entries or sometimes unusable files. Cancer-specific data sources present similar problems. Given that source integration usually improves data quality, one of the objectives is keeping the computational complexity sufficiently low to allow an optimal assimilation and mining of all the information. In particular, the scope of integrating intraomics data can be to improve the exploration of gene co-expression landscapes, while the scope of integrating interomics sources can be that of establishing genotype-phenotype associations. Both integrations are relevant to cancer biomarker meta-analysis, as the proposed study demonstrates. Our approach is based on re-annotating cancer-specific data available at the EBI's ArrayExpress repository and building a data warehouse aimed to biomarker discovery and validation studies. Cancer genes are organized by tissue with biomedical and clinical evidences combined to increase reproducibility and consistency of results. For better comparative evaluation, multiple queries have been designed to efficiently address all types of experiments and platforms, and allow for retrieval of sample-related information, such as cell line, disease state and clinical aspects. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  14. RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets.

    PubMed

    Lazzarato, F; Franceschinis, G; Botta, M; Cordero, F; Calogero, R A

    2004-11-01

    RRE allows the extraction of non-coding regions surrounding a coding sequence [i.e. gene upstream region, 5'-untranslated region (5'-UTR), introns, 3'-UTR, downstream region] from annotated genomic datasets available at NCBI. RRE parser and web-based interface are accessible at http://www.bioinformatica.unito.it/bioinformatics/rre/rre.html

  15. The Planteome database: an integrated resource for reference ontologies, plant genomics and phenomics

    PubMed Central

    Cooper, Laurel; Meier, Austin; Laporte, Marie-Angélique; Elser, Justin L; Mungall, Chris; Sinn, Brandon T; Cavaliere, Dario; Carbon, Seth; Dunn, Nathan A; Smith, Barry; Qu, Botong; Preece, Justin; Zhang, Eugene; Todorovic, Sinisa; Gkoutos, Georgios; Doonan, John H; Stevenson, Dennis W; Arnaud, Elizabeth

    2018-01-01

    Abstract The Planteome project (http://www.planteome.org) provides a suite of reference and species-specific ontologies for plants and annotations to genes and phenotypes. Ontologies serve as common standards for semantic integration of a large and growing corpus of plant genomics, phenomics and genetics data. The reference ontologies include the Plant Ontology, Plant Trait Ontology and the Plant Experimental Conditions Ontology developed by the Planteome project, along with the Gene Ontology, Chemical Entities of Biological Interest, Phenotype and Attribute Ontology, and others. The project also provides access to species-specific Crop Ontologies developed by various plant breeding and research communities from around the world. We provide integrated data on plant traits, phenotypes, and gene function and expression from 95 plant taxa, annotated with reference ontology terms. The Planteome project is developing a plant gene annotation platform; Planteome Noctua, to facilitate community engagement. All the Planteome ontologies are publicly available and are maintained at the Planteome GitHub site (https://github.com/Planteome) for sharing, tracking revisions and new requests. The annotated data are freely accessible from the ontology browser (http://browser.planteome.org/amigo) and our data repository. PMID:29186578

  16. Candidate Gene Identification with SNP Marker-Based Fine Mapping of Anthracnose Resistance Gene Co-4 in Common Bean.

    PubMed

    Burt, Andrew J; William, H Manilal; Perry, Gregory; Khanal, Raja; Pauls, K Peter; Kelly, James D; Navabi, Alireza

    2015-01-01

    Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co-4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co-4 is localized. Three SCAR markers with known linkage to Co-4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK-4 loci found in previous studies. It is possible that the Co-4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases.

  17. Candidate Gene Identification with SNP Marker-Based Fine Mapping of Anthracnose Resistance Gene Co-4 in Common Bean

    PubMed Central

    Burt, Andrew J.; William, H. Manilal; Perry, Gregory; Khanal, Raja; Pauls, K. Peter; Kelly, James D.; Navabi, Alireza

    2015-01-01

    Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co–4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co–4 is localized. Three SCAR markers with known linkage to Co–4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK–4 loci found in previous studies. It is possible that the Co–4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases. PMID:26431031

  18. Multi-Atlas Segmentation using Partially Annotated Data: Methods and Annotation Strategies.

    PubMed

    Koch, Lisa M; Rajchl, Martin; Bai, Wenjia; Baumgartner, Christian F; Tong, Tong; Passerat-Palmbach, Jonathan; Aljabar, Paul; Rueckert, Daniel

    2017-08-22

    Multi-atlas segmentation is a widely used tool in medical image analysis, providing robust and accurate results by learning from annotated atlas datasets. However, the availability of fully annotated atlas images for training is limited due to the time required for the labelling task. Segmentation methods requiring only a proportion of each atlas image to be labelled could therefore reduce the workload on expert raters tasked with annotating atlas images. To address this issue, we first re-examine the labelling problem common in many existing approaches and formulate its solution in terms of a Markov Random Field energy minimisation problem on a graph connecting atlases and the target image. This provides a unifying framework for multi-atlas segmentation. We then show how modifications in the graph configuration of the proposed framework enable the use of partially annotated atlas images and investigate different partial annotation strategies. The proposed method was evaluated on two Magnetic Resonance Imaging (MRI) datasets for hippocampal and cardiac segmentation. Experiments were performed aimed at (1) recreating existing segmentation techniques with the proposed framework and (2) demonstrating the potential of employing sparsely annotated atlas data for multi-atlas segmentation.

  19. MGDB: a comprehensive database of genes involved in melanoma.

    PubMed

    Zhang, Di; Zhu, Rongrong; Zhang, Hanqian; Zheng, Chun-Hou; Xia, Junfeng

    2015-01-01

    The Melanoma Gene Database (MGDB) is a manually curated catalog of molecular genetic data relating to genes involved in melanoma. The main purpose of this database is to establish a network of melanoma related genes and to facilitate the mechanistic study of melanoma tumorigenesis. The entries describing the relationships between melanoma and genes in the current release were manually extracted from PubMed abstracts, which contains cumulative to date 527 human melanoma genes (422 protein-coding and 105 non-coding genes). Each melanoma gene was annotated in seven different aspects (General Information, Expression, Methylation, Mutation, Interaction, Pathway and Drug). In addition, manually curated literature references have also been provided to support the inclusion of the gene in MGDB and establish its association with melanoma. MGDB has a user-friendly web interface with multiple browse and search functions. We hoped MGDB will enrich our knowledge about melanoma genetics and serve as a useful complement to the existing public resources. Database URL: http://bioinfo.ahu.edu.cn:8080/Melanoma/index.jsp. © The Author(s) 2015. Published by Oxford University Press.

  20. Integrating alternative splicing detection into gene prediction.

    PubMed

    Foissac, Sylvain; Schiex, Thomas

    2005-02-10

    Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.

  1. PhytoPath: an integrative resource for plant pathogen genomics.

    PubMed

    Pedro, Helder; Maheswari, Uma; Urban, Martin; Irvine, Alistair George; Cuzick, Alayne; McDowall, Mark D; Staines, Daniel M; Kulesha, Eugene; Hammond-Kosack, Kim Elizabeth; Kersey, Paul Julian

    2016-01-04

    PhytoPath (www.phytopathdb.org) is a resource for genomic and phenotypic data from plant pathogen species, that integrates phenotypic data for genes from PHI-base, an expertly curated catalog of genes with experimentally verified pathogenicity, with the Ensembl tools for data visualization and analysis. The resource is focused on fungi, protists (oomycetes) and bacterial plant pathogens that have genomes that have been sequenced and annotated. Genes with associated PHI-base data can be easily identified across all plant pathogen species using a BioMart-based query tool and visualized in their genomic context on the Ensembl genome browser. The PhytoPath resource contains data for 135 genomic sequences from 87 plant pathogen species, and 1364 genes curated for their role in pathogenicity and as targets for chemical intervention. Support for community annotation of gene models is provided using the WebApollo online gene editor, and we are working with interested communities to improve reference annotation for selected species. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  2. APPRIS 2017: principal isoforms for multiple gene sets

    PubMed Central

    Rodriguez-Rivas, Juan; Di Domenico, Tomás; Vázquez, Jesús; Valencia, Alfonso

    2018-01-01

    Abstract The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants. PMID:29069475

  3. Differentially Expressed Genes in Resistant and Susceptible Common Bean (Phaseolus vulgaris L.) Genotypes in Response to Fusarium oxysporum f. sp. phaseoli

    PubMed Central

    Xue, Renfeng; Wu, Jing; Zhu, Zhendong; Wang, Lanfen; Wang, Xiaoming; Wang, Shumin; Blair, Matthew W.

    2015-01-01

    Fusarium wilt of common bean (Phaseolus vulgaris L.), caused by Fusarium oxysporum Schlechtend.:Fr. f.sp. phaseoli (Fop), is one of the most important diseases of common beans worldwide. Few natural sources of resistance to Fop exist and provide only moderate or partial levels of protection. Despite the economic importance of the disease across multiple crops, only a few of Fop induced genes have been analyzed in legumes. Therefore, our goal was to identify transcriptionally regulated genes during an incompatible interaction between common bean and the Fop pathogen using the cDNA amplified fragment length polymorphism (cDNA-AFLP) technique. We generated a total of 8,730 transcript-derived fragments (TDFs) with 768 primer pairs based on the comparison of a moderately resistant and a susceptible genotype. In total, 423 TDFs (4.9%) displayed altered expression patterns after inoculation with Fop inoculum. We obtained full amplicon sequences for 122 selected TDFs, of which 98 were identified as annotated known genes in different functional categories based on their putative functions, 10 were predicted but non-annotated genes and 14 were not homologous to any known genes. The 98 TDFs encoding genes of known putative function were classified as related to metabolism (22), signal transduction (21), protein synthesis and processing (20), development and cytoskeletal organization (12), transport of proteins (7), gene expression and RNA metabolism (4), redox reactions (4), defense and stress responses (3), energy metabolism (3), and hormone responses (2). Based on the analyses of homology, 19 TDFs from different functional categories were chosen for expression analysis using quantitative RT-PCR. The genes found to be important here were implicated at various steps of pathogen infection and will allow a better understanding of the mechanisms of defense and resistance to Fop and similar pathogens. The differential response genes discovered here could also be used as molecular markers in association mapping or QTL analysis. PMID:26030070

  4. NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.

    PubMed

    Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin; Senger, Philipp

    2015-01-01

    Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html. © The Author(s) 2015. Published by Oxford University Press.

  5. NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases

    PubMed Central

    Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin

    2015-01-01

    Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article’s supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer’s disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html PMID:26475471

  6. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  7. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

  8. Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

    PubMed Central

    Marko, Nicholas F.; Weil, Robert J.

    2012-01-01

    Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects. PMID:23118863

  9. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derivedmore » from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.« less

  10. Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project

    PubMed Central

    Horton, Roger; Gibson, Richard; Coggill, Penny; Miretti, Marcos; Allcock, Richard J.; Almeida, Jeff; Forbes, Simon; Gilbert, James G. R.; Halls, Karen; Harrow, Jennifer L.; Hart, Elizabeth; Howe, Kevin; Jackson, David K.; Palmer, Sophie; Roberts, Anne N.; Sims, Sarah; Stewart, C. Andrew; Traherne, James A.; Trevanion, Steve; Wilming, Laurens; Rogers, Jane; de Jong, Pieter J.; Elliott, John F.; Sawcer, Stephen; Todd, John A.; Trowsdale, John

    2008-01-01

    The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine. PMID:18193213

  11. First Pass Annotation of Promoters on Human Chromosome 22

    PubMed Central

    Scherf, Matthias; Klingenhoff, Andreas; Frech, Kornelie; Quandt, Kerstin; Schneider, Ralf; Grote, Korbinian; Frisch, Matthias; Gailus-Durner, Valérie; Seidel, Alexander; Brack-Werner, Ruth; Werner, Thomas

    2001-01-01

    The publication of the first almost complete sequence of a human chromosome (chromosome 22) is a major milestone in human genomics. Together with the sequence, an excellent annotation of genes was published which certainly will serve as an information resource for numerous future projects. We noted that the annotation did not cover regulatory regions; in particular, no promoter annotation has been provided. Here we present an analysis of the complete published chromosome 22 sequence for promoters. A recent breakthrough in specific in silico prediction of promoter regions enabled us to attempt large-scale prediction of promoter regions on chromosome 22. Scanning of sequence databases revealed only 20 experimentally verified promoters, of which 10 were correctly predicted by our approach. Nearly 40% of our 465 predicted promoter regions are supported by the currently available gene annotation. Promoter finding also provides a biologically meaningful method for “chromosomal scaffolding”, by which long genomic sequences can be divided into segments starting with a gene. As one example, the combination of promoter region prediction with exon/intron structure predictions greatly enhances the specificity of de novo gene finding. The present study demonstrates that it is possible to identify promoters in silico on the chromosomal level with sufficient reliability for experimental planning and indicates that a wealth of information about regulatory regions can be extracted from current large-scale (megabase) sequencing projects. Results are available on-line at http://genomatix.gsf.de/chr22/. PMID:11230158

  12. Global Network Alignment in the Context of Aging.

    PubMed

    Faisal, Fazle Elahi; Zhao, Han; Milenkovic, Tijana

    2015-01-01

    Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. NA faces two algorithmic challenges: 1) Which cost function to use to capture "similarities" between nodes in different networks? 2) Which alignment strategy to use to rapidly identify "high-scoring" alignments from all possible alignments? We "break down" existing state-of-the-art methods that use both different cost functions and different alignment strategies to evaluate each combination of their cost functions and alignment strategies. We find that a combination of the cost function of one method and the alignment strategy of another method beats the existing methods. Hence, we propose this combination as a novel superior NA method. Then, since human aging is hard to study experimentally due to long lifespan, we use NA to transfer aging-related knowledge from well annotated model species to poorly annotated human. By doing so, we produce novel human aging-related knowledge, which complements currently available knowledge about aging that has been obtained mainly by sequence alignment. We demonstrate significant similarity between topological and functional properties of our novel predictions and those of known aging-related genes. We are the first to use NA to learn more about aging.

  13. An integrative approach to inferring biologically meaningful gene modules.

    PubMed

    Cho, Ji-Hoon; Wang, Kai; Galas, David J

    2011-07-26

    The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association. We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions. The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.

  14. GFam: a platform for automatic annotation of gene families.

    PubMed

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-10-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.

  15. Consistent prediction of GO protein localization.

    PubMed

    Spetale, Flavio E; Arce, Debora; Krsticevic, Flavia; Bulacio, Pilar; Tapia, Elizabeth

    2018-05-17

    The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC + , a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC + classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC + classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC + classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.

  16. Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships.

    PubMed

    Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M

    2013-10-01

    The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. Copyright © 2013 The Authors. Published by Elsevier Inc. All rights reserved.

  17. EcoGene 3.0

    PubMed Central

    Zhou, Jindan; Rudd, Kenneth E.

    2013-01-01

    EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection. PMID:23197660

  18. pGenN, a gene normalization tool for plant genes and proteins in scientific literature.

    PubMed

    Ding, Ruoyao; Arighi, Cecilia N; Lee, Jung-Youn; Wu, Cathy H; Vijay-Shanker, K

    2015-01-01

    Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).

  19. EcoGene 3.0.

    PubMed

    Zhou, Jindan; Rudd, Kenneth E

    2013-01-01

    EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection.

  20. Characterization of gonadal transcriptomes from the turbot (Scophthalmus maximus).

    PubMed

    Hu, Yulong; Huang, Meng; Wang, Weiji; Guan, Jiantao; Kong, Jie

    2016-01-01

    The mechanisms underlying sexual reproduction and sex ratio determination remains unclear in turbot, a flatfish of great commercial value. And there is limited information in the turbot database regarding genes related to the reproductive system. Here, we conducted high-throughput transcriptome profiling of turbot gonad tissues to better understand their reproductive functions and to supply essential gene sequence information for marker-assisted selection programs in the turbot industry. In this study, two gonad libraries representing sex differences in Scophthalmus maximus yielded 453 818 high-quality reads that were assembled into 24 611 contigs and 33 713 singletons by using 454 pyrosequencing, 13 936 contigs and singletons (CS) of which were annotated using BLASTx. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analyses revealed that various biological functions and processes were associated with many of the annotated CS. Expression analyses showed that 510 genes were differentially expressed in males versus females; 80% of these genes were annotated. In addition, 6484 and 6036 single nucleotide polymorphisms (SNPs) were identified in male and female libraries, respectively. This transcriptome resource will serve as the foundation for cDNA or SNP microarray construction, gene expression characterization, and sex-specific linkage mapping in turbot.

  1. SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

    PubMed Central

    Damienikan, Aliaksandr U.

    2016-01-01

    The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft Rot Enterobacteriaceae (Pectobacterium and Dickeya spp.) and Pseudomonas spp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome of Pectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of the P. atrosepticum chromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci. PMID:27257541

  2. The Plant Ontology as a Tool for Comparative Plant Anatomy and Genomic Analyses

    PubMed Central

    Cooper, Laurel; Walls, Ramona L.; Elser, Justin; Gandolfo, Maria A.; Stevenson, Dennis W.; Smith, Barry; Preece, Justin; Athreya, Balaji; Mungall, Christopher J.; Rensing, Stefan; Hiss, Manuel; Lang, Daniel; Reski, Ralf; Berardini, Tanya Z.; Li, Donghui; Huala, Eva; Schaeffer, Mary; Menda, Naama; Arnaud, Elizabeth; Shrestha, Rosemary; Yamazaki, Yukiko; Jaiswal, Pankaj

    2013-01-01

    The Plant Ontology (PO; http://www.plantontology.org/) is a publicly available, collaborative effort to develop and maintain a controlled, structured vocabulary (‘ontology’) of terms to describe plant anatomy, morphology and the stages of plant development. The goals of the PO are to link (annotate) gene expression and phenotype data to plant structures and stages of plant development, using the data model adopted by the Gene Ontology. From its original design covering only rice, maize and Arabidopsis, the scope of the PO has been expanded to include all green plants. The PO was the first multispecies anatomy ontology developed for the annotation of genes and phenotypes. Also, to our knowledge, it was one of the first biological ontologies that provides translations (via synonyms) in non-English languages such as Japanese and Spanish. As of Release #18 (July 2012), there are about 2.2 million annotations linking PO terms to >110,000 unique data objects representing genes or gene models, proteins, RNAs, germplasm and quantitative trait loci (QTLs) from 22 plant species. In this paper, we focus on the plant anatomical entity branch of the PO, describing the organizing principles, resources available to users and examples of how the PO is integrated into other plant genomics databases and web portals. We also provide two examples of comparative analyses, demonstrating how the ontology structure and PO-annotated data can be used to discover the patterns of expression of the LEAFY (LFY) and terpene synthase (TPS) gene homologs. PMID:23220694

  3. Enhancing Expressivity of Document-Centered Collaboration with Multimodal Annotations

    ERIC Educational Resources Information Center

    Yoon, Dongwook

    2017-01-01

    As knowledge work moves online, digital documents have become a staple of human collaboration. To communicate beyond the constraints of time and space, remote and asynchronous collaborators create digital annotations over documents, substituting face-to-face meetings with online conversations. However, existing document annotation interfaces…

  4. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  5. DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures.

    PubMed

    Mazandu, Gaston K; Mulder, Nicola J

    2013-09-25

    The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis.

  6. The Complex Transcriptional Response of Acaryochloris marina to Different Oxygen Levels.

    PubMed

    Hernández-Prieto, Miguel A; Lin, Yuankui; Chen, Min

    2017-02-09

    Ancient oxygenic photosynthetic prokaryotes produced oxygen as a waste product, but existed for a long time under an oxygen-free (anoxic) atmosphere, before an oxic atmosphere emerged. The change in oxygen levels in the atmosphere influenced the chemistry and structure of many enzymes that contained prosthetic groups that were inactivated by oxygen. In the genome of Acaryochloris marina , multiple gene copies exist for proteins that are normally encoded by a single gene copy in other cyanobacteria. Using high throughput RNA sequencing to profile transcriptome responses from cells grown under microoxic and hyperoxic conditions, we detected 8446 transcripts out of the 8462 annotated genes in the Cyanobase database. Two-thirds of the 50 most abundant transcripts are key proteins in photosynthesis. Microoxic conditions negatively affected the levels of expression of genes encoding photosynthetic complexes, with the exception of some subunits. In addition to the known regulation of the multiple copies of psbA , we detected a similar transcriptional pattern for psbJ and psbU , which might play a key role in the altered components of photosystem II. Furthermore, regulation of genes encoding proteins important for reactive oxygen species-scavenging is discussed at genome level, including, for the first time, specific small RNAs having possible regulatory roles under varying oxygen levels. Copyright © 2017 Hernandez-Prieto et al.

  7. The Complex Transcriptional Response of Acaryochloris marina to Different Oxygen Levels

    PubMed Central

    Hernández-Prieto, Miguel A.; Lin, Yuankui; Chen, Min

    2016-01-01

    Ancient oxygenic photosynthetic prokaryotes produced oxygen as a waste product, but existed for a long time under an oxygen-free (anoxic) atmosphere, before an oxic atmosphere emerged. The change in oxygen levels in the atmosphere influenced the chemistry and structure of many enzymes that contained prosthetic groups that were inactivated by oxygen. In the genome of Acaryochloris marina, multiple gene copies exist for proteins that are normally encoded by a single gene copy in other cyanobacteria. Using high throughput RNA sequencing to profile transcriptome responses from cells grown under microoxic and hyperoxic conditions, we detected 8446 transcripts out of the 8462 annotated genes in the Cyanobase database. Two-thirds of the 50 most abundant transcripts are key proteins in photosynthesis. Microoxic conditions negatively affected the levels of expression of genes encoding photosynthetic complexes, with the exception of some subunits. In addition to the known regulation of the multiple copies of psbA, we detected a similar transcriptional pattern for psbJ and psbU, which might play a key role in the altered components of photosystem II. Furthermore, regulation of genes encoding proteins important for reactive oxygen species-scavenging is discussed at genome level, including, for the first time, specific small RNAs having possible regulatory roles under varying oxygen levels. PMID:27974439

  8. The Biofuel Feedstock Genomics Resource: a web-based portal and database to enable functional genomics of plant biofuel feedstock species.

    PubMed

    Childs, Kevin L; Konganti, Kranti; Buell, C Robin

    2012-01-01

    Major feedstock sources for future biofuel production are likely to be high biomass producing plant species such as poplar, pine, switchgrass, sorghum and maize. One active area of research in these species is genome-enabled improvement of lignocellulosic biofuel feedstock quality and yield. To facilitate genomic-based investigations in these species, we developed the Biofuel Feedstock Genomic Resource (BFGR), a database and web-portal that provides high-quality, uniform and integrated functional annotation of gene and transcript assembly sequences from species of interest to lignocellulosic biofuel feedstock researchers. The BFGR includes sequence data from 54 species and permits researchers to view, analyze and obtain annotation at the gene, transcript, protein and genome level. Annotation of biochemical pathways permits the identification of key genes and transcripts central to the improvement of lignocellulosic properties in these species. The integrated nature of the BFGR in terms of annotation methods, orthologous/paralogous relationships and linkage to seven species with complete genome sequences allows comparative analyses for biofuel feedstock species with limited sequence resources. Database URL: http://bfgr.plantbiology.msu.edu.

  9. Monitoring transcription initiation activities in rat and dog.

    PubMed

    Lizio, Marina; Mukarram, Abdul Kadir; Ohno, Mizuho; Watanabe, Shoko; Itoh, Masayoshi; Hasegawa, Akira; Lassmann, Timo; Severin, Jessica; Harshbarger, Jayson; Abugessaisa, Imad; Kasukawa, Takeya; Hon, Chung Chau; Carninci, Piero; Hayashizaki, Yoshihide; Forrest, Alistair R R; Kawaji, Hideya

    2017-11-28

    The promoter landscape of several non-human model organisms is far from complete. As a part of FANTOM5 data collection, we generated 13 profiles of transcription initiation activities in dog and rat aortic smooth muscle cells, mesenchymal stem cells and hepatocytes by employing CAGE (Cap Analysis of Gene Expression) technology combined with single molecule sequencing. Our analyses show that the CAGE profiles recapitulate known transcription start sites (TSSs) consistently, in addition to uncover novel TSSs. Our dataset can be thus used with high confidence to support gene annotation in dog and rat species. We identified 28,497 and 23,147 CAGE peaks, or promoter regions, for rat and dog respectively, and associated them to known genes. This approach could be seen as a standard method for improvement of existing gene models, as well as discovery of novel genes. Given that the FANTOM5 data collection includes dog and rat matched cell types in human and mouse as well, this data would also be useful for cross-species studies.

  10. Reduce Manual Curation by Combining Gene Predictions from Multiple Annotation Engines, a Case Study of Start Codon Prediction

    PubMed Central

    Ederveen, Thomas H. A.; Overmars, Lex; van Hijum, Sacha A. F. T.

    2013-01-01

    Nowadays, prokaryotic genomes are sequenced faster than the capacity to manually curate gene annotations. Automated genome annotation engines provide users a straight-forward and complete solution for predicting ORF coordinates and function. For many labs, the use of AGEs is therefore essential to decrease the time necessary for annotating a given prokaryotic genome. However, it is not uncommon for AGEs to provide different and sometimes conflicting predictions. Combining multiple AGEs might allow for more accurate predictions. Here we analyzed the ab initio open reading frame (ORF) calling performance of different AGEs based on curated genome annotations of eight strains from different bacterial species with GC% ranging from 35–52%. We present a case study which demonstrates a novel way of comparative genome annotation, using combinations of AGEs in a pre-defined order (or path) to predict ORF start codons. The order of AGE combinations is from high to low specificity, where the specificity is based on the eight genome annotations. For each AGE combination we are able to derive a so-called projected confidence value, which is the average specificity of ORF start codon prediction based on the eight genomes. The projected confidence enables estimating likeliness of a correct prediction for a particular ORF start codon by a particular AGE combination, pinpointing ORFs notoriously difficult to predict start codons. We correctly predict start codons for 90.5±4.8% of the genes in a genome (based on the eight genomes) with an accuracy of 81.1±7.6%. Our consensus-path methodology allows a marked improvement over majority voting (9.7±4.4%) and with an optimal path ORF start prediction sensitivity is gained while maintaining a high specificity. PMID:23675487

  11. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community

    PubMed Central

    Hosmani, Prashant S.; Villalobos-Ayala, Krystal; Miller, Sherry; Shippy, Teresa; Flores, Mirella; Rosendale, Andrew; Cordola, Chris; Bell, Tracey; Mann, Hannah; DeAvila, Gabe; DeAvila, Daniel; Moore, Zachary; Buller, Kyle; Ciolkevich, Kathryn; Nandyal, Samantha; Mahoney, Robert; Van Voorhis, Joshua; Dunlevy, Megan; Farrow, David; Hunter, David; Morgan, Taylar; Shore, Kayla; Guzman, Victoria; Izsak, Allison; Dixon, Danielle E.; Cridge, Andrew; Cano, Liliana; Cao, Xiaolong; Jiang, Haobo; Leng, Nan; Johnson, Shannon; Cantarel, Brandi L.; Richards, Stephen; English, Adam; Shatters, Robert G.; Childers, Chris; Chen, Mei-Ju; Hunter, Wayne; Cilia, Michelle; Mueller, Lukas A.; Munoz-Torres, Monica; Nelson, David; Poelchau, Monica F.; Benoit, Joshua B.; Wiersma-Koch, Helen; D’Elia, Tom; Brown, Susan J.

    2017-01-01

    Abstract The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. Database URL: https://citrusgreening.org/ PMID:29220441

  12. Functional annotation of regulatory pathways.

    PubMed

    Pandey, Jayesh; Koyutürk, Mehmet; Kim, Yohan; Szpankowski, Wojciech; Subramaniam, Shankar; Grama, Ananth

    2007-07-01

    Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level. We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, Narada, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations. Narada is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/.

  13. Elucidation of primary metabolic pathways in Aspergillus species: orphaned research in characterizing orphan genes.

    PubMed

    Andersen, Mikael Rørdam

    2014-11-01

    Primary metabolism affects all phenotypical traits of filamentous fungi. Particular examples include reacting to extracellular stimuli, producing precursor molecules required for cell division and morphological changes as well as providing monomer building blocks for production of secondary metabolites and extracellular enzymes. In this review, all annotated genes from four Aspergillus species have been examined. In this process, it becomes evident that 80-96% of the genes (depending on the species) are still without verified function. A significant proportion of the genes with verified metabolic functions are assigned to secondary or extracellular metabolism, leaving only 2-4% of the annotated genes within primary metabolism. It is clear that primary metabolism has not received the same attention in the post-genomic area as many other research areas--despite its role at the very centre of cellular function. However, several methods can be employed to use the metabolic networks in tandem with comparative genomics to accelerate functional assignment of genes in primary metabolism. In particular, gaps in metabolic pathways can be used to assign functions to orphan genes. In this review, applications of this from the Aspergillus genes will be examined, and it is proposed that, where feasible, this should be a standard part of functional annotation of fungal genomes. © The Author 2014. Published by Oxford University Press.

  14. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGES

    Brettin, Thomas; Davis, James J.; Disz, Terry; ...

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  15. Management and analysis of genomic functional and phenotypic controlled annotations to support biomedical investigation and practice.

    PubMed

    Masseroli, Marco

    2007-07-01

    The growing available genomic information provides new opportunities for novel research approaches and original biomedical applications that can provide effective data management and analysis support. In fact, integration and comprehensive evaluation of available controlled data can highlight information patterns leading to unveil new biomedical knowledge. Here, we describe Genome Function INtegrated Discover (GFINDer), a Web-accessible three-tier multidatabase system we developed to automatically enrich lists of user-classified genes with several functional and phenotypic controlled annotations, and to statistically evaluate them in order to identify annotation categories significantly over- or underrepresented in each considered gene class. Genomic controlled annotations from Gene Ontology (GO), KEGG, Pfam, InterPro, and Online Mendelian Inheritance in Man (OMIM) were integrated in GFINDer and several categorical tests were implemented for their analysis. A controlled vocabulary of inherited disorder phenotypes was obtained by normalizing and hierarchically structuring disease accompanying signs and symptoms from OMIM Clinical Synopsis sections. GFINDer modular architecture is well suited for further system expansion and for sustaining increasing workload. Testing results showed that GFINDer analyses can highlight gene functional and phenotypic characteristics and differences, demonstrating its value in supporting genomic biomedical approaches aiming at understanding the complex biomolecular mechanisms underlying patho-physiological phenotypes, and in helping the transfer of genomic results to medical practice.

  16. Annotating ebony on the fly.

    PubMed

    Kohn, Michael H; Wittkopp, Patricia J

    2007-07-01

    The distinctive black phenotype of ebony mutants has made it one of the most widely used phenotypic markers in Drosophila genetics. Without doubt, ebony showcases the fruits of the fly community's labours to annotate gene function. As of this writing, FlyBase lists 142 references, 1277 fly stocks, 15 phenotypes and 44 alleles. In addition to its namesake pigmentation phenotype, ebony mutants affect other traits, including phototaxis and courtship. With phenotypic consequences of ebony variants readily apparent in the laboratory, does natural selection also see them in the wild? In this issue of Molecular Ecology, Pool & Aquadro investigate this question and found signs of natural selection on the ebony gene that appear to have resulted from selection for darker pigmentation at higher elevations in sub-Saharan populations of Drosophila melanogaster. Such findings from population genomic analysis of wild-derived strains should be included in gene annotations to provide a more holistic view of a gene's function. The evolutionary annotation of ebony added by Pool & Aquadro substantiates that pigmentation can be adaptive and implicates elevation as an important selective factor. This is important progress because the selective factors seem to differ between populations and species. In addition, the study raises issues to consider when extrapolating from selection at the molecular level to selection at the phenotypic level.

  17. A Uniform System For The Annotation Of Human microRNA Genes And The Evolution Of The Human microRNAome

    PubMed Central

    Fromm, Bastian; Billipp, Tyler; Peck, Liam E.; Johansen, Morten; Tarver, James E.; King, Benjamin L.; Newcomb, James M.; Sempere, Lorenzo F.; Flatmark, Kjersti; Hovig, Eivind; Peterson, Kevin J.

    2016-01-01

    Although microRNAs (miRNAs) are among the most intensively studied molecules of the past 20 years, determining what is and what is not a miRNA has not been straightforward. Here, we present a uniform system for the annotation and nomenclature of miRNA genes. We show that fewer than a third of the 1,881 human miRBase entries, and only approximately 16% of the 7,095 metazoan miRBase entries, are robustly supported as miRNA genes. Furthermore, we show that the human repertoire of miRNAs has been shaped by periods of intense miRNA innovation, and that mature gene products show a very different tempo and mode of sequence evolution than star products. We establish a new open access database -- MirGeneDB (http://mirgenedb.org) -- to catalog this set of robustly supported miRNAs, which complements the efforts of miRBase, but differs from it by annotating the mature versus star products, and by imposing an evolutionary hierarchy upon this curated and consistently named repertoire. PMID:26473382

  18. Phage phenomics: Physiological approaches to characterize novel viral proteins

    ScienceCinema

    Sanchez, Savannah E. [San Diego State Univ., San Diego, CA (United States); Cuevas, Daniel A. [San Diego State Univ., San Diego, CA (United States); Rostron, Jason E. [San Diego State Univ., San Diego, CA (United States); Liang, Tiffany Y. [San Diego State Univ., San Diego, CA (United States); Pivaroff, Cullen G. [San Diego State Univ., San Diego, CA (United States); Haynes, Matthew R. [San Diego State Univ., San Diego, CA (United States); Nulton, Jim [San Diego State Univ., San Diego, CA (United States); Felts, Ben [San Diego State Univ., San Diego, CA (United States); Bailey, Barbara A. [San Diego State Univ., San Diego, CA (United States); Salamon, Peter [San Diego State Univ., San Diego, CA (United States); Edwards, Robert A. [San Diego State Univ., San Diego, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Burgin, Alex B. [Broad Institute, Cambridge, MA (United States); Segall, Anca M. [San Diego State Univ., San Diego, CA (United States); Rohwer, Forest [San Diego State Univ., San Diego, CA (United States)

    2018-06-21

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysis by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.

  19. Experimental annotation of the human genome using microarray technology.

    PubMed

    Shoemaker, D D; Schadt, E E; Armour, C D; He, Y D; Garrett-Engele, P; McDonagh, P D; Loerch, P M; Leonardson, A; Lum, P Y; Cavet, G; Wu, L F; Altschuler, S J; Edwards, S; King, J; Tsang, J S; Schimmack, G; Schelter, J M; Koch, J; Ziman, M; Marton, M J; Li, B; Cundiff, P; Ward, T; Castle, J; Krolewski, M; Meyer, M R; Mao, M; Burchard, J; Kidd, M J; Dai, H; Phillips, J W; Linsley, P S; Stoughton, R; Scherer, S; Boguski, M S

    2001-02-15

    The most important product of the sequencing of a genome is a complete, accurate catalogue of genes and their products, primarily messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-jet oligonucleotide synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and allow the detection of mRNA splice variants and identification of the tissue- and disease-specific conditions under which genes are expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, more efficient, full-length complementary DNA cloning strategies and application to complex diseases.

  20. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  1. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Quest, Daniel J; Land, Miriam L; Brettin, Thomas S

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software systemmore » to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.« less

  2. ISEScan: automated identification of insertion sequence elements in prokaryotic genomes.

    PubMed

    Xie, Zhiqun; Tang, Haixu

    2017-11-01

    The insertion sequence (IS) elements are the smallest but most abundant autonomous transposable elements in prokaryotic genomes, which play a key role in prokaryotic genome organization and evolution. With the fast growing genomic data, it is becoming increasingly critical for biology researchers to be able to accurately and automatically annotate ISs in prokaryotic genome sequences. The available automatic IS annotation systems are either providing only incomplete IS annotation or relying on the availability of existing genome annotations. Here, we present a new IS elements annotation pipeline to address these issues. ISEScan is a highly sensitive software pipeline based on profile hidden Markov models constructed from manually curated IS elements. ISEScan performs better than existing IS annotation systems when tested on prokaryotic genomes with curated annotations of IS elements. Applying it to 2784 prokaryotic genomes, we report the global distribution of IS families across taxonomic clades in Archaea and Bacteria. ISEScan is implemented in Python and released as an open source software at https://github.com/xiezhq/ISEScan. hatang@indiana.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  3. Literature-based concept profiles for gene annotation: the issue of weighting.

    PubMed

    Jelier, Rob; Schuemie, Martijn J; Roes, Peter-Jan; van Mulligen, Erik M; Kors, Jan A

    2008-05-01

    Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment. Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance. All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.

  4. Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data.

    PubMed

    He, Zihuai; Xu, Bin; Lee, Seunggeun; Ionita-Laza, Iuliana

    2017-09-07

    Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  5. Transcriptome Analysis of Fat Bodies from Two Brown Planthopper (Nilaparvata lugens) Populations with Different Virulence Levels in Rice

    PubMed Central

    Chen, Hongdan; Lai, Wenxiang; Fu, Qiang; Lou, Yonggen

    2014-01-01

    Background The brown planthopper (BPH), Nilaparvata lugens (Stål), one of the most serious rice insect pests in Asia, can quickly overcome rice resistance by evolving new virulent populations. The insect fat body plays essential roles in the life cycles of insects and in plant-insect interactions. However, whether differences in fat body transcriptomes exist between insect populations with different virulence levels and whether the transcriptomic differences are related to insect virulence remain largely unknown. Methodology/Principal Findings In this study, we performed transcriptome-wide analyses on the fat bodies of two BPH populations with different virulence levels in rice. The populations were derived from rice variety TN1 (TN1 population) and Mudgo (M population). In total, 33,776 and 32,332 unigenes from the fat bodies of TN1 and M populations, respectively, were generated using Illumina technology. Gene ontology annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology classifications indicated that genes related to metabolism and immunity were significantly active in the fat bodies. In addition, a total of 339 unigenes showed homology to genes of yeast-like symbionts (YLSs) from 12 genera and endosymbiotic bacteria Wolbachia. A comparative analysis of the two transcriptomes generated 7,860 differentially expressed genes. GO annotations and enrichment analysis of KEGG pathways indicated these differentially expressed transcripts might be involved in metabolism and immunity. Finally, 105 differentially expressed genes from YLSs and Wolbachia were identified, genes which might be associated with the formation of different virulent populations. Conclusions/Significance This study was the first to compare the fat-body transcriptomes of two BPH populations having different virulence traits and to find genes that may be related to this difference. Our findings provide a molecular resource for future investigations of fat bodies and will be useful in examining the interactions between the fat body and virulence variation in the BPH. PMID:24533099

  6. Integrating text mining into the MGI biocuration workflow

    PubMed Central

    Dowell, K.G.; McAndrews-Hill, M.S.; Hill, D.P.; Drabkin, H.J.; Blake, J.A.

    2009-01-01

    A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen ∼1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature. PMID:20157492

  7. Integrating text mining into the MGI biocuration workflow.

    PubMed

    Dowell, K G; McAndrews-Hill, M S; Hill, D P; Drabkin, H J; Blake, J A

    2009-01-01

    A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals.In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen approximately 1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database.Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.

  8. Transcriptome analysis of fat bodies from two brown planthopper (Nilaparvata lugens) populations with different virulence levels in rice.

    PubMed

    Yu, Haixin; Ji, Rui; Ye, Wenfeng; Chen, Hongdan; Lai, Wenxiang; Fu, Qiang; Lou, Yonggen

    2014-01-01

    The brown planthopper (BPH), Nilaparvata lugens (Stål), one of the most serious rice insect pests in Asia, can quickly overcome rice resistance by evolving new virulent populations. The insect fat body plays essential roles in the life cycles of insects and in plant-insect interactions. However, whether differences in fat body transcriptomes exist between insect populations with different virulence levels and whether the transcriptomic differences are related to insect virulence remain largely unknown. In this study, we performed transcriptome-wide analyses on the fat bodies of two BPH populations with different virulence levels in rice. The populations were derived from rice variety TN1 (TN1 population) and Mudgo (M population). In total, 33,776 and 32,332 unigenes from the fat bodies of TN1 and M populations, respectively, were generated using Illumina technology. Gene ontology annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology classifications indicated that genes related to metabolism and immunity were significantly active in the fat bodies. In addition, a total of 339 unigenes showed homology to genes of yeast-like symbionts (YLSs) from 12 genera and endosymbiotic bacteria Wolbachia. A comparative analysis of the two transcriptomes generated 7,860 differentially expressed genes. GO annotations and enrichment analysis of KEGG pathways indicated these differentially expressed transcripts might be involved in metabolism and immunity. Finally, 105 differentially expressed genes from YLSs and Wolbachia were identified, genes which might be associated with the formation of different virulent populations. This study was the first to compare the fat-body transcriptomes of two BPH populations having different virulence traits and to find genes that may be related to this difference. Our findings provide a molecular resource for future investigations of fat bodies and will be useful in examining the interactions between the fat body and virulence variation in the BPH.

  9. HPIDB 2.0: a curated database for host–pathogen interactions

    PubMed Central

    Ammari, Mais G.; Gresham, Cathy R.; McCarthy, Fiona M.; Nanduri, Bindu

    2016-01-01

    Identification and analysis of host–pathogen interactions (HPI) is essential to study infectious diseases. However, HPI data are sparse in existing molecular interaction databases, especially for agricultural host–pathogen systems. Therefore, resources that annotate, predict and display the HPI that underpin infectious diseases are critical for developing novel intervention strategies. HPIDB 2.0 (http://www.agbase.msstate.edu/hpi/main.html) is a resource for HPI data, and contains 45, 238 manually curated entries in the current release. Since the first description of the database in 2010, multiple enhancements to HPIDB data and interface services were made that are described here. Notably, HPIDB 2.0 now provides targeted biocuration of molecular interaction data. As a member of the International Molecular Exchange consortium, annotations provided by HPIDB 2.0 curators meet community standards to provide detailed contextual experimental information and facilitate data sharing. Moreover, HPIDB 2.0 provides access to rapidly available community annotations that capture minimum molecular interaction information to address immediate researcher needs for HPI network analysis. In addition to curation, HPIDB 2.0 integrates HPI from existing external sources and contains tools to infer additional HPI where annotated data are scarce. Compared to other interaction databases, our data collection approach ensures HPIDB 2.0 users access the most comprehensive HPI data from a wide range of pathogens and their hosts (594 pathogen and 70 host species, as of February 2016). Improvements also include enhanced search capacity, addition of Gene Ontology functional information, and implementation of network visualization. The changes made to HPIDB 2.0 content and interface ensure that users, especially agricultural researchers, are able to easily access and analyse high quality, comprehensive HPI data. All HPIDB 2.0 data are updated regularly, are publically available for direct download, and are disseminated to other molecular interaction resources. Database URL: http://www.agbase.msstate.edu/hpi/main.html PMID:27374121

  10. A statistical approach to identify, monitor, and manage incomplete curated data sets.

    PubMed

    Howe, Douglas G

    2018-04-02

    Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here. In this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval. This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.

  11. Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship.

    PubMed

    Brunet, Marie A; Levesque, Sébastien A; Hunting, Darel J; Cohen, Alan A; Roucou, Xavier

    2018-05-01

    Technological advances promise unprecedented opportunities for whole exome sequencing and proteomic analyses of populations. Currently, data from genome and exome sequencing or proteomic studies are searched against reference genome annotations. This provides the foundation for research and clinical screening for genetic causes of pathologies. However, current genome annotations substantially underestimate the proteomic information encoded within a gene. Numerous studies have now demonstrated the expression and function of alternative (mainly small, sometimes overlapping) ORFs within mature gene transcripts. This has important consequences for the correlation of phenotypes and genotypes. Most alternative ORFs are not yet annotated because of a lack of evidence, and this absence from databases precludes their detection by standard proteomic methods, such as mass spectrometry. Here, we demonstrate how current approaches tend to overlook alternative ORFs, hindering the discovery of new genetic drivers and fundamental research. We discuss available tools and techniques to improve identification of proteins from alternative ORFs and finally suggest a novel annotation system to permit a more complete representation of the transcriptomic and proteomic information contained within a gene. Given the crucial challenge of distinguishing functional ORFs from random ones, the suggested pipeline emphasizes both experimental data and conservation signatures. The addition of alternative ORFs in databases will render identification less serendipitous and advance the pace of research and genomic knowledge. This review highlights the urgent medical and research need to incorporate alternative ORFs in current genome annotations and thus permit their inclusion in hypotheses and models, which relate phenotypes and genotypes. © 2018 Brunet et al.; Published by Cold Spring Harbor Laboratory Press.

  12. An integrative approach to inferring biologically meaningful gene modules

    PubMed Central

    2011-01-01

    Background The ability to construct biologically meaningful gene networks and modules is critical for contemporary systems biology. Though recent studies have demonstrated the power of using gene modules to shed light on the functioning of complex biological systems, most modules in these networks have shown little association with meaningful biological function. We have devised a method which directly incorporates gene ontology (GO) annotation in construction of gene modules in order to gain better functional association. Results We have devised a method, Semantic Similarity-Integrated approach for Modularization (SSIM) that integrates various gene-gene pairwise similarity values, including information obtained from gene expression, protein-protein interactions and GO annotations, in the construction of modules using affinity propagation clustering. We demonstrated the performance of the proposed method using data from two complex biological responses: 1. the osmotic shock response in Saccharomyces cerevisiae, and 2. the prion-induced pathogenic mouse model. In comparison with two previously reported algorithms, modules identified by SSIM showed significantly stronger association with biological functions. Conclusions The incorporation of semantic similarity based on GO annotation with gene expression and protein-protein interaction data can greatly enhance the functional relevance of inferred gene modules. In addition, the SSIM approach can also reveal the hierarchical structure of gene modules to gain a broader functional view of the biological system. Hence, the proposed method can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level. PMID:21791051

  13. The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions

    PubMed Central

    Kim, Sun; Chatr-aryamontri, Andrew; Chang, Christie S.; Oughtred, Rose; Rust, Jennifer; Wilbur, W. John; Comeau, Donald C.; Dolinski, Kara; Tyers, Mike

    2017-01-01

    A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html PMID:28077563

  14. LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures.

    PubMed

    Ryan, Michael; Diekhans, Mark; Lien, Stephanie; Liu, Yun; Karchin, Rachel

    2009-06-01

    LS-SNP/PDB is a new WWW resource for genome-wide annotation of human non-synonymous (amino acid changing) SNPs. It serves high-quality protein graphics rendered with UCSF Chimera molecular visualization software. The system is kept up-to-date by an automated, high-throughput build pipeline that systematically maps human nsSNPs onto Protein Data Bank structures and annotates several biologically relevant features. LS-SNP/PDB is available at (http://ls-snp.icm.jhu.edu/ls-snp-pdb) and via links from protein data bank (PDB) biology and chemistry tabs, UCSC Genome Browser Gene Details and SNP Details pages and PharmGKB Gene Variants Downloads/Cross-References pages.

  15. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  16. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  17. Genome-wide profiling of 24 hr diel rhythmicity in the water flea, Daphnia pulex: network analysis reveals rhythmic gene expression and enhances functional gene annotation.

    PubMed

    Rund, Samuel S C; Yoo, Boyoung; Alam, Camille; Green, Taryn; Stephens, Melissa T; Zeng, Erliang; George, Gary F; Sheppard, Aaron D; Duffield, Giles E; Milenković, Tijana; Pfrender, Michael E

    2016-08-18

    Marine and freshwater zooplankton exhibit daily rhythmic patterns of behavior and physiology which may be regulated directly by the light:dark (LD) cycle and/or a molecular circadian clock. One of the best-studied zooplankton taxa, the freshwater crustacean Daphnia, has a 24 h diel vertical migration (DVM) behavior whereby the organism travels up and down through the water column daily. DVM plays a critical role in resource tracking and the behavioral avoidance of predators and damaging ultraviolet radiation. However, there is little information at the transcriptional level linking the expression patterns of genes to the rhythmic physiology/behavior of Daphnia. Here we analyzed genome-wide temporal transcriptional patterns from Daphnia pulex collected over a 44 h time period under a 12:12 LD cycle (diel) conditions using a cosine-fitting algorithm. We used a comprehensive network modeling and analysis approach to identify novel co-regulated rhythmic genes that have similar network topological properties and functional annotations as rhythmic genes identified by the cosine-fitting analyses. Furthermore, we used the network approach to predict with high accuracy novel gene-function associations, thus enhancing current functional annotations available for genes in this ecologically relevant model species. Our results reveal that genes in many functional groupings exhibit 24 h rhythms in their expression patterns under diel conditions. We highlight the rhythmic expression of immunity, oxidative detoxification, and sensory process genes. We discuss differences in the chronobiology of D. pulex from other well-characterized terrestrial arthropods. This research adds to a growing body of literature suggesting the genetic mechanisms governing rhythmicity in crustaceans may be divergent from other arthropod lineages including insects. Lastly, these results highlight the power of using a network analysis approach to identify differential gene expression and provide novel functional annotation.

  18. Transcriptome profiling of claw muscle of the mud crab (Scylla paramamosain) at different fattening stages

    PubMed Central

    Jiang, Qingling; Bao, Chenchang; Yang, Ya’nan; Liu, An; Liu, Fang; Huang, Huiyang; Ye, Haihui

    2017-01-01

    In crustaceans, muscle growth and development is complicated, and to date substantial knowledge gaps exist. In this study, the claw muscle, hepatopancreas and nervous tissue of the mud crab (Scylla paramamosain) were collected at three fattening stages for sequence by the Illumina sequencing. A total of 127.87 Gb clean data with no less than 3.94 Gb generated for each sample and the cycleQ30 percentages were more than 86.13% for all samples. De Bruijn assembly of these clean data produced 94,853 unigenes, thereinto, 50,059 unigenes were found in claw muscle. A total of 121 differentially expressed genes (DEGs) were revealed in claw muscle from the three fattening stages with a Padj value < 0.01, including 63 genes with annotation. Functional annotation and enrichment analysis showed that the DEGs clusters represented the predominant gene catalog with roles in biochemical processes (glycolysis, phosphorylation and regulation of transcription), molecular function (ATP binding, 6-phosphofructokinase activity, and sequence-specific DNA binding) and cellular component (6-phosphofructokinase complex, plasma membrane, and integral component of membrane). qRT-PCR was employed to further validate certain DEGs. Single nucleotide polymorphism (SNP) analysis obtained 159,322, 125,963 and 166,279 potential SNPs from the muscle transcriptome at stage B, stage C and stage D, respectively. In addition, there were sixteen neuropeptide transcripts being predicted in the claw muscle. The present study provides a comprehensive transcriptome of claw muscle of S. paramamosain during fattening, providing a basis for screening the functional genes that may affect muscle growth of S. paramamosain. PMID:29141033

  19. Detection of alternative splice variants at the proteome level in Aspergillus flavus.

    PubMed

    Chang, Kung-Yen; Georgianna, D Ryan; Heber, Steffen; Payne, Gary A; Muddiman, David C

    2010-03-05

    Identification of proteins from proteolytic peptides or intact proteins plays an essential role in proteomics. Researchers use search engines to match the acquired peptide sequences to the target proteins. However, search engines depend on protein databases to provide candidates for consideration. Alternative splicing (AS), the mechanism where the exon of pre-mRNAs can be spliced and rearranged to generate distinct mRNA and therefore protein variants, enable higher eukaryotic organisms, with only a limited number of genes, to have the requisite complexity and diversity at the proteome level. Multiple alternative isoforms from one gene often share common segments of sequences. However, many protein databases only include a limited number of isoforms to keep minimal redundancy. As a result, the database search might not identify a target protein even with high quality tandem MS data and accurate intact precursor ion mass. We computationally predicted an exhaustive list of putative isoforms of Aspergillus flavus proteins from 20 371 expressed sequence tags to investigate whether an alternative splicing protein database can assign a greater proportion of mass spectrometry data. The newly constructed AS database provided 9807 new alternatively spliced variants in addition to 12 832 previously annotated proteins. The searches of the existing tandem MS spectra data set using the AS database identified 29 new proteins encoded by 26 genes. Nine fungal genes appeared to have multiple protein isoforms. In addition to the discovery of splice variants, AS database also showed potential to improve genome annotation. In summary, the introduction of an alternative splicing database helps identify more proteins and unveils more information about a proteome.

  20. The Essential Genome of Escherichia coli K-12

    PubMed Central

    2018-01-01

    ABSTRACT Transposon-directed insertion site sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries, and therefore, it remains unclear whether the two methodologies are comparable. To address this, a high-density transposon library was constructed in Escherichia coli K-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false-positive identification of essential genes, statistical data analysis included corrections for both gene length and genome length. Through this analysis, new essential genes and genes previously incorrectly designated essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects, and fine-resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis data sets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry. PMID:29463657

  1. Genome-wide screening of indicator genes for assessing the potential carcinogenic risk of Nanjing city drinking water.

    PubMed

    Zhang, Rui; Cheng, Shupei; Li, Aimin; Sun, Jie; Zhang, Yan; Zhang, Xuxiang

    2011-07-01

    Effects of all pollutants existing in the Nanjing city drinking water (DWNC) on mouse gene transcription levels were measured to assess the DWNC carcinogenic risks and to identify candidate indicator genes for assessing and early warning the cancer risks. Transcriptional expression levels of 14,000 hepatic genes for the treatment group mice (Mus musculus, ICR) fed with DWNC for 90 days were detected using the GeneChip(®) Mouse Genome 430A 2.0 array. The analysis indicated that the transcriptional levels of 294 genes were up-regulated and 542 ones were down-regulated. Of these genes, 12 ones identified to be involved in at least five different types of cancers were further analyzed. An interrogation by Kyoto Encyclopedia of Genes and Genomes (KEGG) revealed that three (including ITGAV, CCND1 and SMAD2) of the 12 genes were mapped to pathway in cancer. Gene Ontology (GO) function annotation also showed that they were associated with the functional categories of cell cycle regulation, adhesion, apoptosis, signal transduction and so on which are closely implicated in tumorigenesis and progression. The correlations between the aberrant expressions of them and the genesis and progression of cancers have been further documented by a number of scientific researches. These results might demonstrate that the potential toxicity and carcinogenic risks were associated with DWNC. Moreover, ITGAV, CCND1 and SMAD2 were identified as the most likely candidate indicator genes for the assessment of the combined carcinogenic risk of all pollutants existing in DWNC.

  2. Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17

    PubMed Central

    Hart, Elizabeth A; Caccamo, Mario; Harrow, Jennifer L; Humphray, Sean J; Gilbert, James GR; Trevanion, Steve; Hubbard, Tim; Rogers, Jane; Rothschild, Max F

    2007-01-01

    Background We describe here the sequencing, annotation and comparative analysis of an 8 Mb region of pig chromosome 17, which provides a useful test region to assess coverage and quality for the pig genome sequencing project. We report our findings comparing the annotation of draft sequence assembled at different depths of coverage. Results Within this region we annotated 71 loci, of which 53 are orthologous to human known coding genes. When compared to the syntenic regions in human (20q13.13-q13.33) and mouse (chromosome 2, 167.5 Mb-178.3 Mb), this region was found to be highly conserved with respect to gene order. The most notable difference between the three species is the presence of a large expansion of zinc finger coding genes and pseudogenes on mouse chromosome 2 between Edn3 and Phactr3 that is absent from pig and human. All of our annotation has been made publicly available in the Vertebrate Genome Annotation browser, VEGA. We assessed the impact of coverage on sequence assembly across this region and found, as expected, that increased sequence depth resulted in fewer, longer contigs. One-third of our annotated loci could not be fully re-aligned back to the low coverage version of the sequence, principally because the transcripts are fragmented over several contigs. Conclusion We have demonstrated the considerable advantages of sequencing at increased read depths and discuss the implications that lower coverage sequence may have on subsequent comparative and functional studies, particularly those involving complex loci such as GNAS. PMID:17705864

  3. Highlighting the Need for Systems-Level Experimental Characterization of Plant Metabolic Enzymes.

    PubMed

    Engqvist, Martin K M

    2016-01-01

    The biology of living organisms is determined by the action and interaction of a large number of individual gene products, each with specific functions. Discovering and annotating the function of gene products is key to our understanding of these organisms. Controlled experiments and bioinformatic predictions both contribute to functional gene annotation. For most species it is difficult to gain an overview of what portion of gene annotations are based on experiments and what portion represent predictions. Here, I survey the current state of experimental knowledge of enzymes and metabolism in Arabidopsis thaliana as well as eleven economically important crops and forestry trees - with a particular focus on reactions involving organic acids in central metabolism. I illustrate the limited availability of experimental data for functional annotation of enzymes in most of these species. Many enzymes involved in metabolism of citrate, malate, fumarate, lactate, and glycolate in crops and forestry trees have not been characterized. Furthermore, enzymes involved in key biosynthetic pathways which shape important traits in crops and forestry trees have not been characterized. I argue for the development of novel high-throughput platforms with which limited functional characterization of gene products can be performed quickly and relatively cheaply. I refer to this approach as systems-level experimental characterization. The data collected from such platforms would form a layer intermediate between bioinformatic gene function predictions and in-depth experimental studies of these functions. Such a data layer would greatly aid in the pursuit of understanding a multiplicity of biological processes in living organisms.

  4. Archaeal Clusters of Orthologous Genes (arCOGs): An Update and Application for Analysis of Shared Features between Thermococcales, Methanococcales, and Methanobacteriales

    PubMed Central

    Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.

    2015-01-01

    With the continuously accelerating genome sequencing from diverse groups of archaea and bacteria, accurate identification of gene orthology and availability of readily expandable clusters of orthologous genes are essential for the functional annotation of new genomes. We report an update of the collection of archaeal Clusters of Orthologous Genes (arCOGs) to cover, on average, 91% of the protein-coding genes in 168 archaeal genomes. The new arCOGs were constructed using refined algorithms for orthology identification combined with extensive manual curation, including incorporation of the results of several completed and ongoing research projects in archaeal genomics. A new level of classification is introduced, superclusters that unit two or more arCOGs and more completely reflect gene family evolution than individual, disconnected arCOGs. Assessment of the current archaeal genome annotation in public databases indicates that consistent use of arCOGs can significantly improve the annotation quality. In addition to their utility for genome annotation, arCOGs also are a platform for phylogenomic analysis. We explore this aspect of arCOGs by performing a phylogenomic study of the Thermococci that are traditionally viewed as the basal branch of the Euryarchaeota. The results of phylogenomic analysis that involved both comparison of multiple phylogenetic trees and a search for putative derived shared characters by using phyletic patterns extracted from the arCOGs reveal a likely evolutionary relationship between the Thermococci, Methanococci, and Methanobacteria. The arCOGs are expected to be instrumental for a comprehensive phylogenomic study of the archaea. PMID:25764277

  5. Large-Scale Sequencing of Two Regions in Human Chromosome 7q22: Analysis of 650 kb of Genomic Sequence around the EPO and CUTL1 Loci Reveals 17 Genes

    PubMed Central

    Glöckner, Gernot; Scherer, Stephen; Schattevoy, Ruben; Boright, Andrew; Weber, Jacqueline; Tsui, Lap-Chee; Rosenthal, André

    1998-01-01

    We have sequenced and annotated two genomic regions located in the Giemsa negative band q22 of human chromosome 7. The first region defined by the erythropoietin (EPO) locus is 228 kb in length and contains 13 genes. Whereas 3 genes (GNB2, EPO, PCOLCE) were known previously on the mRNA level, we have been able to identify 10 novel genes using a newly developed automatic annotation tool RUMMAGE-DP, which comprises >26 different programs mainly for exon prediction, homology searches, and compositional and repeat analysis. For precise annotation we have also resequenced ESTs identified to the region and assembled them to build large cDNAs. In addition, we have investigated the differential splicing of genes. Using these tools we annotated 4 of the 10 genes as a zonadhesin, a transferrin homolog, a nucleoporin-like gene, and an actin gene. Two genes showed weak similarity to an insulin-like receptor and a neuronal protein with a leucine-rich amino-terminal domain. Four predicted genes (CDS1–CDS4) CDS that have been confirmed on the mRNA level showed no similarity to known proteins and a potential function could not be assigned. The second region in 7q22 defined by the CUTL1 (CCAAT displacement protein and its splice variant) locus is 416 kb in length and contains three known genes, including PMSL12, APS, CUTL1, and a novel gene (CDS5). The CUTL1 locus, consisting of two splice variants (CDP and CASP), occupies >300 kb. Based on the G,C profile an isochore switch can be defined between the CUTL1 gene and the APS and PMSL12 genes. [Clones 37G3, 164c7, and 235f8 are deposited in GenBank under accession no. AF053356; clone 123e15, accession no. AF024533; 186d2, accession no. AF024534; 46f6, accession no. AF006752; 50h2, accession no. AF047825; and 76h2, accession no. AF030453] PMID:9799793

  6. Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs

    PubMed Central

    Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C

    2003-01-01

    Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626

  7. pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

    PubMed Central

    Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.

    2015-01-01

    Background Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. Methods In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. Results We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/). PMID:26258475

  8. MicroScope-an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic data.

    PubMed

    Médigue, Claudine; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Gautreau, Guillaume; Josso, Adrien; Lajus, Aurélie; Langlois, Jordan; Pereira, Hugo; Planel, Rémi; Roche, David; Rollin, Johan; Rouy, Zoe; Vallenet, David

    2017-09-12

    The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources. © The Author 2017. Published by Oxford University Press.

  9. Tissue-specific Proteogenomic Analysis of Plutella xylostella Larval Midgut Using a Multialgorithm Pipeline.

    PubMed

    Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun

    2016-06-01

    The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.

  10. GRN2SBML: automated encoding and annotation of inferred gene regulatory networks complying with SBML.

    PubMed

    Vlaic, Sebastian; Hoffmann, Bianca; Kupfer, Peter; Weber, Michael; Dräger, Andreas

    2013-09-01

    GRN2SBML automatically encodes gene regulatory networks derived from several inference tools in systems biology markup language. Providing a graphical user interface, the networks can be annotated via the simple object access protocol (SOAP)-based application programming interface of BioMart Central Portal and minimum information required in the annotation of models registry. Additionally, we provide an R-package, which processes the output of supported inference algorithms and automatically passes all required parameters to GRN2SBML. Therefore, GRN2SBML closes a gap in the processing pipeline between the inference of gene regulatory networks and their subsequent analysis, visualization and storage. GRN2SBML is freely available under the GNU Public License version 3 and can be downloaded from http://www.hki-jena.de/index.php/0/2/490. General information on GRN2SBML, examples and tutorials are available at the tool's web page.

  11. Sheep genome functional annotation reveals proximal regulatory elements contributed to the evolution of modern breeds.

    PubMed

    Naval-Sanchez, Marina; Nguyen, Quan; McWilliam, Sean; Porto-Neto, Laercio R; Tellam, Ross; Vuocolo, Tony; Reverter, Antonio; Perez-Enciso, Miguel; Brauning, Rudiger; Clarke, Shannon; McCulloch, Alan; Zamani, Wahid; Naderi, Saeid; Rezaei, Hamid Reza; Pompanon, Francois; Taberlet, Pierre; Worley, Kim C; Gibbs, Richard A; Muzny, Donna M; Jhangiani, Shalini N; Cockett, Noelle; Daetwyler, Hans; Kijas, James

    2018-02-28

    Domestication fundamentally reshaped animal morphology, physiology and behaviour, offering the opportunity to investigate the molecular processes driving evolutionary change. Here we assess sheep domestication and artificial selection by comparing genome sequence from 43 modern breeds (Ovis aries) and their Asian mouflon ancestor (O. orientalis) to identify selection sweeps. Next, we provide a comparative functional annotation of the sheep genome, validated using experimental ChIP-Seq of sheep tissue. Using these annotations, we evaluate the impact of selection and domestication on regulatory sequences and find that sweeps are significantly enriched for protein coding genes, proximal regulatory elements of genes and genome features associated with active transcription. Finally, we find individual sites displaying strong allele frequency divergence are enriched for the same regulatory features. Our data demonstrate that remodelling of gene expression is likely to have been one of the evolutionary forces that drove phenotypic diversification of this common livestock species.

  12. Statistical assessment of crosstalk enrichment between gene groups in biological networks.

    PubMed

    McCormack, Theodore; Frings, Oliver; Alexeyenko, Andrey; Sonnhammer, Erik L L

    2013-01-01

    Analyzing groups of functionally coupled genes or proteins in the context of global interaction networks has become an important aspect of bioinformatic investigations. Assessing the statistical significance of crosstalk enrichment between or within groups of genes can be a valuable tool for functional annotation of experimental gene sets. Here we present CrossTalkZ, a statistical method and software to assess the significance of crosstalk enrichment between pairs of gene or protein groups in large biological networks. We demonstrate that the standard z-score is generally an appropriate and unbiased statistic. We further evaluate the ability of four different methods to reliably recover crosstalk within known biological pathways. We conclude that the methods preserving the second-order topological network properties perform best. Finally, we show how CrossTalkZ can be used to annotate experimental gene sets using known pathway annotations and that its performance at this task is superior to gene enrichment analysis (GEA). CrossTalkZ (available at http://sonnhammer.sbc.su.se/download/software/CrossTalkZ/) is implemented in C++, easy to use, fast, accepts various input file formats, and produces a number of statistics. These include z-score, p-value, false discovery rate, and a test of normality for the null distributions.

  13. Annotation of the Asian Citrus Psyllid Genome Reveals a Reduced Innate Immune System

    PubMed Central

    Arp, Alex P.; Hunter, Wayne B.; Pelz-Stelinski, Kirsten S.

    2016-01-01

    Citrus production worldwide is currently facing significant losses due to citrus greening disease, also known as Huanglongbing. The citrus greening bacteria, Candidatus Liberibacter asiaticus (CLas), is a persistent propagative pathogen transmitted by the Asian citrus psyllid, Diaphorina citri Kuwayama (Hemiptera: Liviidae). Hemipterans characterized to date lack a number of insect immune genes, including those associated with the Imd pathway targeting Gram-negative bacteria. The D. citri draft genome was used to characterize the immune defense genes present in D. citri. Predicted mRNAs identified by screening the published D. citri annotated draft genome were manually searched using a custom database of immune genes from previously annotated insect genomes. Toll and JAK/STAT pathways, general defense genes Dual oxidase, Nitric oxide synthase, prophenoloxidase, and cellular immune defense genes were present in D. citri. In contrast, D. citri lacked genes for the Imd pathway, most antimicrobial peptides, 1,3-β-glucan recognition proteins (GNBPs), and complete peptidoglycan recognition proteins. These data suggest that D. citri has a reduced immune capability similar to that observed in A. pisum, P. humanus, and R. prolixus. The absence of immune system genes from the D. citri genome may facilitate CLas infections, and is possibly compensated for by their relationship with their microbial endosymbionts. PMID:27965582

  14. RATT: Rapid Annotation Transfer Tool

    PubMed Central

    Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew

    2011-01-01

    Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991

  15. DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures

    PubMed Central

    2013-01-01

    Background The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. Results We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. Conclusions The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis. PMID:24067102

  16. Data on the genome-wide identification of CNL R-genes in Setaria italica (L.) P. Beauv.

    PubMed

    Andersen, Ethan J; Nepal, Madhav P

    2017-08-01

    We report data associated with the identification of 242 disease resistance genes (R-genes) in the genome of Setaria italica as presented in "Genetic diversity of disease resistance genes in foxtail millet ( Setaria italica L.)" (Andersen and Nepal, 2017) [1]. Our data describe the structure and evolution of the Coiled-coil, Nucleotide-binding site, Leucine-rich repeat (CNL) R-genes in foxtail millet. The CNL genes were identified through rigorous extraction and analysis of recently available plant genome sequences using cutting-edge analytical software. Data visualization includes gene structure diagrams, chromosomal syntenic maps, a chromosomal density plot, and a maximum-likelihood phylogenetic tree comparing Sorghum bicolor , Panicum virgatum , Setaria italica , and Arabidopsis thaliana . Compilation of InterProScan annotations, Gene Ontology (GO) annotations, and Basic Local Alignment Search Tool (BLAST) results for the 242 R-genes identified in the foxtail millet genome are also included in tabular format.

  17. Visualizing conserved gene location across microbe genomes

    NASA Astrophysics Data System (ADS)

    Shaw, Chris D.

    2009-01-01

    This paper introduces an analysis-based zoomable visualization technique for displaying the location of genes across many related species of microbes. The purpose of this visualizatiuon is to enable a biologist to examine the layout of genes in the organism of interest with respect to the gene organization of related organisms. During the genomic annotation process, the ability to observe gene organization in common with previously annotated genomes can help a biologist better confirm the structure and function of newly analyzed microbe DNA sequences. We have developed a visualization and analysis tool that enables the biologist to observe and examine gene organization among genomes, in the context of the primary sequence of interest. This paper describes the visualization and analysis steps, and presents a case study using a number of Rickettsia genomes.

  18. Comparative genomics approaches to understanding and manipulating plant metabolism.

    PubMed

    Bradbury, Louis M T; Niehaus, Tom D; Hanson, Andrew D

    2013-04-01

    Over 3000 genomes, including numerous plant genomes, are now sequenced. However, their annotation remains problematic as illustrated by the many conserved genes with no assigned function, vague annotations such as 'kinase', or even wrong ones. Around 40% of genes of unknown function that are conserved between plants and microbes are probably metabolic enzymes or transporters; finding functions for these genes is a major challenge. Comparative genomics has correctly predicted functions for many such genes by analyzing genomic context, and gene fusions, distributions and co-expression. Comparative genomics complements genetic and biochemical approaches to dissect metabolism, continues to increase in power and decrease in cost, and has a pivotal role in modeling and engineering by helping identify functions for all metabolic genes. Copyright © 2012 Elsevier Ltd. All rights reserved.

  19. Concept annotation in the CRAFT corpus.

    PubMed

    Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

    2012-07-09

    Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

  20. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  1. Gene Ontology-Based Analysis of Zebrafish Omics Data Using the Web Tool Comparative Gene Ontology.

    PubMed

    Ebrahimie, Esmaeil; Fruzangohar, Mario; Moussavi Nik, Seyyed Hani; Newman, Morgan

    2017-10-01

    Gene Ontology (GO) analysis is a powerful tool in systems biology, which uses a defined nomenclature to annotate genes/proteins within three categories: "Molecular Function," "Biological Process," and "Cellular Component." GO analysis can assist in revealing functional mechanisms underlying observed patterns in transcriptomic, genomic, and proteomic data. The already extensive and increasing use of zebrafish for modeling genetic and other diseases highlights the need to develop a GO analytical tool for this organism. The web tool Comparative GO was originally developed for GO analysis of bacterial data in 2013 ( www.comparativego.com ). We have now upgraded and elaborated this web tool for analysis of zebrafish genetic data using GOs and annotations from the Gene Ontology Consortium.

  2. Major Histocompatibility Complex Genes Map to Two Chromosomes in an Evolutionarily Ancient Reptile, the Tuatara Sphenodon punctatus

    PubMed Central

    Miller, Hilary C.; O’Meally, Denis; Ezaz, Tariq; Amemiya, Chris; Marshall-Graves, Jennifer A.; Edwards, Scott

    2015-01-01

    Major histocompatibility complex (MHC) genes are a central component of the vertebrate immune system and usually exist in a single genomic region. However, considerable differences in MHC organization and size exist between different vertebrate lineages. Reptiles occupy a key evolutionary position for understanding how variation in MHC structure evolved in vertebrates, but information on the structure of the MHC region in reptiles is limited. In this study, we investigate the organization and cytogenetic location of MHC genes in the tuatara (Sphenodon punctatus), the sole extant representative of the early-diverging reptilian order Rhynchocephalia. Sequencing and mapping of 12 clones containing class I and II MHC genes from a bacterial artificial chromosome library indicated that the core MHC region is located on chromosome 13q. However, duplication and translocation of MHC genes outside of the core region was evident, because additional class I MHC genes were located on chromosome 4p. We found a total of seven class I sequences and 11 class II β sequences, with evidence for duplication and pseudogenization of genes within the tuatara lineage. The tuatara MHC is characterized by high repeat content and low gene density compared with other species and we found no antigen processing or MHC framework genes on the MHC gene-containing clones. Our findings indicate substantial differences in MHC organization in tuatara compared with mammalian and avian MHCs and highlight the dynamic nature of the MHC. Further sequencing and annotation of tuatara and other reptile MHCs will determine if the tuatara MHC is representative of nonavian reptiles in general. PMID:25953959

  3. chromoWIZ: a web tool to query and visualize chromosome-anchored genes from cereal and model genomes.

    PubMed

    Nussbaumer, Thomas; Kugler, Karl G; Schweiger, Wolfgang; Bader, Kai C; Gundlach, Heidrun; Spannagl, Manuel; Poursarebani, Naser; Pfeifer, Matthias; Mayer, Klaus F X

    2014-12-10

    Over the last years reference genome sequences of several economically and scientifically important cereals and model plants became available. Despite the agricultural significance of these crops only a small number of tools exist that allow users to inspect and visualize the genomic position of genes of interest in an interactive manner. We present chromoWIZ, a web tool that allows visualizing the genomic positions of relevant genes and comparing these data between different plant genomes. Genes can be queried using gene identifiers, functional annotations, or sequence homology in four grass species (Triticum aestivum, Hordeum vulgare, Brachypodium distachyon, Oryza sativa). The distribution of the anchored genes is visualized along the chromosomes by using heat maps. Custom gene expression measurements, differential expression information, and gene-to-group mappings can be uploaded and can be used for further filtering. This tool is mainly designed for breeders and plant researchers, who are interested in the location and the distribution of candidate genes as well as in the syntenic relationships between different grass species. chromoWIZ is freely available and online accessible at http://mips.helmholtz-muenchen.de/plant/chromoWIZ/index.jsp.

  4. Transcription start site associated RNAs (TSSaRNAs) are ubiquitous in all domains of life.

    PubMed

    Zaramela, Livia S; Vêncio, Ricardo Z N; ten-Caten, Felipe; Baliga, Nitin S; Koide, Tie

    2014-01-01

    A plethora of non-coding RNAs has been discovered using high-resolution transcriptomics tools, indicating that transcriptional and post-transcriptional regulation is much more complex than previously appreciated. Small RNAs associated with transcription start sites of annotated coding regions (TSSaRNAs) are pervasive in both eukaryotes and bacteria. Here, we provide evidence for existence of TSSaRNAs in several archaeal transcriptomes including: Halobacterium salinarum, Pyrococcus furiosus, Methanococcus maripaludis, and Sulfolobus solfataricus. We validated TSSaRNAs from the model archaeon Halobacterium salinarum NRC-1 by deep sequencing two independent small-RNA enriched (RNA-seq) and a primary-transcript enriched (dRNA-seq) strand-specific libraries. We identified 652 transcripts, of which 179 were shown to be primary transcripts (∼7% of the annotated genome). Distinct growth-associated expression patterns between TSSaRNAs and their cognate genes were observed, indicating a possible role in environmental responses that may result from RNA polymerase with varying pausing rhythms. This work shows that TSSaRNAs are ubiquitous across all domains of life.

  5. Protein complex prediction in large ontology attributed protein-protein interaction networks.

    PubMed

    Zhang, Yijia; Lin, Hongfei; Yang, Zhihao; Wang, Jian; Li, Yanpeng; Xu, Bo

    2013-01-01

    Protein complexes are important for unraveling the secrets of cellular organization and function. Many computational approaches have been developed to predict protein complexes in protein-protein interaction (PPI) networks. However, most existing approaches focus mainly on the topological structure of PPI networks, and largely ignore the gene ontology (GO) annotation information. In this paper, we constructed ontology attributed PPI networks with PPI data and GO resource. After constructing ontology attributed networks, we proposed a novel approach called CSO (clustering based on network structure and ontology attribute similarity). Structural information and GO attribute information are complementary in ontology attributed networks. CSO can effectively take advantage of the correlation between frequent GO annotation sets and the dense subgraph for protein complex prediction. Our proposed CSO approach was applied to four different yeast PPI data sets and predicted many well-known protein complexes. The experimental results showed that CSO was valuable in predicting protein complexes and achieved state-of-the-art performance.

  6. Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability.

    PubMed

    Ho, Daniel W H; Sze, Karen M F; Ng, Irene O L

    2015-08-28

    Viral integration into the human genome upon infection is an important risk factor for various human malignancies. We developed viral integration site detection tool called Virus-Clip, which makes use of information extracted from soft-clipped sequencing reads to identify exact positions of human and virus breakpoints of integration events. With initial read alignment to virus reference genome and streamlined procedures, Virus-Clip delivers a simple, fast and memory-efficient solution to viral integration site detection. Moreover, it can also automatically annotate the integration events with the corresponding affected human genes. Virus-Clip has been verified using whole-transcriptome sequencing data and its detection was validated to have satisfactory sensitivity and specificity. Marked advancement in performance was detected, compared to existing tools. It is applicable to versatile types of data including whole-genome sequencing, whole-transcriptome sequencing, and targeted sequencing. Virus-Clip is available at http://web.hku.hk/~dwhho/Virus-Clip.zip.

  7. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.

    PubMed

    Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer

    2017-01-04

    The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. GenomeRNAi: a database for cell-based RNAi phenotypes.

    PubMed

    Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael

    2007-01-01

    RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at http://rnai.dkfz.de.

  9. GenomeRNAi: a database for cell-based RNAi phenotypes

    PubMed Central

    Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael

    2007-01-01

    RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at PMID:17135194

  10. Comprehensive cellular‐resolution atlas of the adult human brain

    PubMed Central

    Royall, Joshua J.; Sunkin, Susan M.; Ng, Lydia; Facer, Benjamin A.C.; Lesnar, Phil; Guillozet‐Bongaarts, Angie; McMurray, Bergen; Szafer, Aaron; Dolbeare, Tim A.; Stevens, Allison; Tirrell, Lee; Benner, Thomas; Caldejon, Shiella; Dalley, Rachel A.; Dee, Nick; Lau, Christopher; Nyhus, Julie; Reding, Melissa; Riley, Zackery L.; Sandman, David; Shen, Elaine; van der Kouwe, Andre; Varjabedian, Ani; Write, Michelle; Zollei, Lilla; Dang, Chinh; Knowles, James A.; Koch, Christof; Phillips, John W.; Sestan, Nenad; Wohnoutka, Paul; Zielke, H. Ronald; Hohmann, John G.; Jones, Allan R.; Bernard, Amy; Hawrylycz, Michael J.; Hof, Patrick R.; Fischl, Bruce

    2016-01-01

    ABSTRACT Detailed anatomical understanding of the human brain is essential for unraveling its functional architecture, yet current reference atlases have major limitations such as lack of whole‐brain coverage, relatively low image resolution, and sparse structural annotation. We present the first digital human brain atlas to incorporate neuroimaging, high‐resolution histology, and chemoarchitecture across a complete adult female brain, consisting of magnetic resonance imaging (MRI), diffusion‐weighted imaging (DWI), and 1,356 large‐format cellular resolution (1 µm/pixel) Nissl and immunohistochemistry anatomical plates. The atlas is comprehensively annotated for 862 structures, including 117 white matter tracts and several novel cyto‐ and chemoarchitecturally defined structures, and these annotations were transferred onto the matching MRI dataset. Neocortical delineations were done for sulci, gyri, and modified Brodmann areas to link macroscopic anatomical and microscopic cytoarchitectural parcellations. Correlated neuroimaging and histological structural delineation allowed fine feature identification in MRI data and subsequent structural identification in MRI data from other brains. This interactive online digital atlas is integrated with existing Allen Institute for Brain Science gene expression atlases and is publicly accessible as a resource for the neuroscience community. J. Comp. Neurol. 524:3127–3481, 2016. © 2016 The Authors The Journal of Comparative Neurology Published by Wiley Periodicals, Inc. PMID:27418273

  11. A Review of Computational Methods for Finding Non-Coding RNA Genes

    PubMed Central

    Abbas, Qaisar; Raza, Syed Mansoor; Biyabani, Azizuddin Ahmed; Jaffar, Muhammad Arfan

    2016-01-01

    Finding non-coding RNA (ncRNA) genes has emerged over the past few years as a cutting-edge trend in bioinformatics. There are numerous computational intelligence (CI) challenges in the annotation and interpretation of ncRNAs because it requires a domain-related expert knowledge in CI techniques. Moreover, there are many classes predicted yet not experimentally verified by researchers. Recently, researchers have applied many CI methods to predict the classes of ncRNAs. However, the diverse CI approaches lack a definitive classification framework to take advantage of past studies. A few review papers have attempted to summarize CI approaches, but focused on the particular methodological viewpoints. Accordingly, in this article, we summarize in greater detail than previously available, the CI techniques for finding ncRNAs genes. We differentiate from the existing bodies of research and discuss concisely the technical merits of various techniques. Lastly, we review the limitations of ncRNA gene-finding CI methods with a point-of-view towards the development of new computational tools. PMID:27918472

  12. GeneRIF indexing: sentence selection based on machine learning.

    PubMed

    Jimeno-Yepes, Antonio J; Sticco, J Caitlin; Mork, James G; Aronson, Alan R

    2013-05-31

    A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.

  13. MINE: Module Identification in Networks

    PubMed Central

    2011-01-01

    Background Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks. Results MINE outperforms MCODE, CFinder, NEMO, SPICi, and MCL in identifying non-exclusive, high modularity clusters when applied to the C. elegans protein-protein interaction network. The algorithm generally achieves superior geometric accuracy and modularity for annotated functional categories. In comparison with the most closely related algorithm, MCODE, the top clusters identified by MINE are consistently of higher density and MINE is less likely to designate overlapping modules as a single unit. MINE offers a high level of granularity with a small number of adjustable parameters, enabling users to fine-tune cluster results for input networks with differing topological properties. Conclusions MINE was created in response to the challenge of discovering high quality modules of gene products within highly interconnected biological networks. The algorithm allows a high degree of flexibility and user-customisation of results with few adjustable parameters. MINE outperforms several popular clustering algorithms in identifying modules with high modularity and obtains good overall recall and precision of functional annotations in protein-protein interaction networks from both S. cerevisiae and C. elegans. PMID:21605434

  14. Plant Omics Data Center: An Integrated Web Repository for Interspecies Gene Expression Networks with NLP-Based Curation

    PubMed Central

    Ohyanagi, Hajime; Takano, Tomoyuki; Terashima, Shin; Kobayashi, Masaaki; Kanno, Maasa; Morimoto, Kyoko; Kanegae, Hiromi; Sasaki, Yohei; Saito, Misa; Asano, Satomi; Ozaki, Soichi; Kudo, Toru; Yokoyama, Koji; Aya, Koichiro; Suwabe, Keita; Suzuki, Go; Aoki, Koh; Kubo, Yasutaka; Watanabe, Masao; Matsuoka, Makoto; Yano, Kentaro

    2015-01-01

    Comprehensive integration of large-scale omics resources such as genomes, transcriptomes and metabolomes will provide deeper insights into broader aspects of molecular biology. For better understanding of plant biology, we aim to construct a next-generation sequencing (NGS)-derived gene expression network (GEN) repository for a broad range of plant species. So far we have incorporated information about 745 high-quality mRNA sequencing (mRNA-Seq) samples from eight plant species (Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, Solanum tuberosum, Medicago truncatula and Glycine max) from the public short read archive, digitally profiled the entire set of gene expression profiles, and drawn GENs by using correspondence analysis (CA) to take advantage of gene expression similarities. In order to understand the evolutionary significance of the GENs from multiple species, they were linked according to the orthology of each node (gene) among species. In addition to other gene expression information, functional annotation of the genes will facilitate biological comprehension. Currently we are improving the given gene annotations with natural language processing (NLP) techniques and manual curation. Here we introduce the current status of our analyses and the web database, PODC (Plant Omics Data Center; http://bioinf.mind.meiji.ac.jp/podc/), now open to the public, providing GENs, functional annotations and additional comprehensive omics resources. PMID:25505034

  15. Large-Scale Collection and Analysis of Full-Length cDNAs from Brachypodium distachyon and Integration with Pooideae Sequence Resources

    PubMed Central

    Mochida, Keiichi; Uehara-Yamaguchi, Yukiko; Takahashi, Fuminori; Yoshida, Takuhiro; Sakurai, Tetsuya; Shinozaki, Kazuo

    2013-01-01

    A comprehensive collection of full-length cDNAs is essential for correct structural gene annotation and functional analyses of genes. We constructed a mixed full-length cDNA library from 21 different tissues of Brachypodium distachyon Bd21, and obtained 78,163 high quality expressed sequence tags (ESTs) from both ends of ca. 40,000 clones (including 16,079 contigs). We updated gene structure annotations of Brachypodium genes based on full-length cDNA sequences in comparison with the latest publicly available annotations. About 10,000 non-redundant gene models were supported by full-length cDNAs; ca. 6,000 showed some transcription unit modifications. We also found ca. 580 novel gene models, including 362 newly identified in Bd21. Using the updated transcription start sites, we searched a total of 580 plant cis-motifs in the −3 kb promoter regions and determined a genome-wide Brachypodium promoter architecture. Furthermore, we integrated the Brachypodium full-length cDNAs and updated gene structures with available sequence resources in wheat and barley in a web-accessible database, the RIKEN Brachypodium FL cDNA database. The database represents a “one-stop” information resource for all genomic information in the Pooideae, facilitating functional analysis of genes in this model grass plant and seamless knowledge transfer to the Triticeae crops. PMID:24130698

  16. The pig X and Y Chromosomes: structure, sequence, and evolution

    PubMed Central

    Skinner, Benjamin M.; Sargent, Carole A.; Churcher, Carol; Hunt, Toby; Herrero, Javier; Loveland, Jane E.; Dunn, Matt; Louzada, Sandra; Fu, Beiyuan; Chow, William; Gilbert, James; Austin-Guest, Siobhan; Beal, Kathryn; Carvalho-Silva, Denise; Cheng, William; Gordon, Daria; Grafham, Darren; Hardy, Matt; Harley, Jo; Hauser, Heidi; Howden, Philip; Howe, Kerstin; Lachani, Kim; Ellis, Peter J.I.; Kelly, Daniel; Kerry, Giselle; Kerwin, James; Ng, Bee Ling; Threadgold, Glen; Wileman, Thomas; Wood, Jonathan M.D.; Yang, Fengtang; Harrow, Jen; Affara, Nabeel A.; Tyler-Smith, Chris

    2016-01-01

    We have generated an improved assembly and gene annotation of the pig X Chromosome, and a first draft assembly of the pig Y Chromosome, by sequencing BAC and fosmid clones from Duroc animals and incorporating information from optical mapping and fiber-FISH. The X Chromosome carries 1033 annotated genes, 690 of which are protein coding. Gene order closely matches that found in primates (including humans) and carnivores (including cats and dogs), which is inferred to be ancestral. Nevertheless, several protein-coding genes present on the human X Chromosome were absent from the pig, and 38 pig-specific X-chromosomal genes were annotated, 22 of which were olfactory receptors. The pig Y-specific Chromosome sequence generated here comprises 30 megabases (Mb). A 15-Mb subset of this sequence was assembled, revealing two clusters of male-specific low copy number genes, separated by an ampliconic region including the HSFY gene family, which together make up most of the short arm. Both clusters contain palindromes with high sequence identity, presumably maintained by gene conversion. Many of the ancestral X-related genes previously reported in at least one mammalian Y Chromosome are represented either as active genes or partial sequences. This sequencing project has allowed us to identify genes—both single copy and amplified—on the pig Y Chromosome, to compare the pig X and Y Chromosomes for homologous sequences, and thereby to reveal mechanisms underlying pig X and Y Chromosome evolution. PMID:26560630

  17. GOGrapher: A Python library for GO graph representation and analysis.

    PubMed

    Muller, Brian; Richards, Adam J; Jin, Bo; Lu, Xinghua

    2009-07-07

    The Gene Ontology is the most commonly used controlled vocabulary for annotating proteins. The concepts in the ontology are organized as a directed acyclic graph, in which a node corresponds to a biological concept and a directed edge denotes the parent-child semantic relationship between a pair of terms. A large number of protein annotations further create links between proteins and their functional annotations, reflecting the contemporary knowledge about proteins and their functional relationships. This leads to a complex graph consisting of interleaved biological concepts and their associated proteins. What is needed is a simple, open source library that provides tools to not only create and view the Gene Ontology graph, but to analyze and manipulate it as well. Here we describe the development and use of GOGrapher, a Python library that can be used for the creation, analysis, manipulation, and visualization of Gene Ontology related graphs. An object-oriented approach was adopted to organize the hierarchy of the graphs types and associated classes. An Application Programming Interface is provided through which different types of graphs can be pragmatically created, manipulated, and visualized. GOGrapher has been successfully utilized in multiple research projects, e.g., a graph-based multi-label text classifier for protein annotation. The GOGrapher project provides a reusable programming library designed for the manipulation and analysis of Gene Ontology graphs. The library is freely available for the scientific community to use and improve.

  18. Transcriptome sequencing and annotation for the Jamaican fruit bat (Artibeus jamaicensis).

    PubMed

    Shaw, Timothy I; Srivastava, Anuj; Chou, Wen-Chi; Liu, Liang; Hawkinson, Ann; Glenn, Travis C; Adams, Rick; Schountz, Tony

    2012-01-01

    The Jamaican fruit bat (Artibeus jamaicensis) is one of the most common bats in the tropical Americas. It is thought to be a potential reservoir host of Tacaribe virus, an arenavirus closely related to the South American hemorrhagic fever viruses. We performed transcriptome sequencing and annotation from lung, kidney and spleen tissues using 454 and Illumina platforms to develop this species as an animal model. More than 100,000 contigs were assembled, with 25,000 genes that were functionally annotated. Of the remaining unannotated contigs, 80% were found within bat genomes or transcriptomes. Annotated genes are involved in a broad range of activities ranging from cellular metabolism to genome regulation through ncRNAs. Reciprocal BLAST best hits yielded 8,785 sequences that are orthologous to mouse, rat, cattle, horse and human. Species tree analysis of sequences from 2,378 loci was used to achieve 95% bootstrap support for the placement of bat as sister to the clade containing horse, dog, and cattle. Through substitution rate estimation between bat and human, 32 genes were identified with evidence for positive selection. We also identified 466 immune-related genes, which may be useful for studying Tacaribe virus infection of this species. The Jamaican fruit bat transcriptome dataset is a resource that should provide additional candidate markers for studying bat evolution and ecology, and tools for analysis of the host response and pathology of disease.

  19. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis.

    PubMed

    Grossmann, Steffen; Bauer, Sebastian; Robinson, Peter N; Vingron, Martin

    2007-11-15

    High-throughput experiments such as microarray hybridizations often yield long lists of genes found to share a certain characteristic such as differential expression. Exploring Gene Ontology (GO) annotations for such lists of genes has become a widespread practice to get first insights into the potential biological meaning of the experiment. The standard statistical approach to measuring overrepresentation of GO terms cannot cope with the dependencies resulting from the structure of GO because they analyze each term in isolation. Especially the fact that annotations are inherited from more specific descendant terms can result in certain types of false-positive results with potentially misleading biological interpretation, a phenomenon which we term the inheritance problem. We present here a novel approach to analysis of GO term overrepresentation that determines overrepresentation of terms in the context of annotations to the term's parents. This approach reduces the dependencies between the individual term's measurements, and thereby avoids producing false-positive results owing to the inheritance problem. ROC analysis using study sets with overrepresented GO terms showed a clear advantage for our approach over the standard algorithm with respect to the inheritance problem. Although there can be no gold standard for exploratory methods such as analysis of GO term overrepresentation, analysis of biological datasets suggests that our algorithm tends to identify the core GO terms that are most characteristic of the dataset being analyzed.

  20. Introduction to the fathead minnow genome browser and ...

    EPA Pesticide Factsheets

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minnow genomic sequence. This work is meant to extend the utility of fathead minnow genome as a resource and enable the continued development of this species as a model organism. The fathead minnow (Pimephales promelas) is a laboratory model organism widely used in regulatory toxicity testing and ecotoxicology research. Despite, the wealth of toxicological data for this organism, until recently genome scale information was lacking for the species, which limited the utility of the species for pathway-based toxicity testing and research. As part of a EPA Pathfinder Innovation Project, next generation sequencing was applied to generate a draft genome assembly, which was published in 2016. However, application of those genome-scale sequencing resources was still limited by the lack of available gene annotations for fathead minnow. Here we report on development of a first generation genome annotation for fathead minnow and the dissemination of that information through a web-based browser that makes it easy to search for genes of interest, extract the corresponding sequence, identify intron and exon boundaries and regulatory regions, and align the computationally predicted genes with other supporti

  1. FARME DB: a functional antibiotic resistance element database

    PubMed Central

    Wallace, James C.; Port, Jesse A.; Smith, Marissa N.; Faustman, Elaine M.

    2017-01-01

    Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates. Database URL: http://staff.washington.edu/jwallace/farme PMID:28077567

  2. DOSim: an R package for similarity between diseases based on Disease Ontology.

    PubMed

    Li, Jiang; Gong, Binsheng; Chen, Xi; Liu, Tao; Wu, Chao; Zhang, Fan; Li, Chunquan; Li, Xiang; Rao, Shaoqi; Li, Xia

    2011-06-29

    The construction of the Disease Ontology (DO) has helped promote the investigation of diseases and disease risk factors. DO enables researchers to analyse disease similarity by adopting semantic similarity measures, and has expanded our understanding of the relationships between different diseases and to classify them. Simultaneously, similarities between genes can also be analysed by their associations with similar diseases. As a result, disease heterogeneity is better understood and insights into the molecular pathogenesis of similar diseases have been gained. However, bioinformatics tools that provide easy and straight forward ways to use DO to study disease and gene similarity simultaneously are required. We have developed an R-based software package (DOSim) to compute the similarity between diseases and to measure the similarity between human genes in terms of diseases. DOSim incorporates a DO-based enrichment analysis function that can be used to explore the disease feature of an independent gene set. A multilayered enrichment analysis (GO and KEGG annotation) annotation function that helps users explore the biological meaning implied in a newly detected gene module is also part of the DOSim package. We used the disease similarity application to demonstrate the relationship between 128 different DO cancer terms. The hierarchical clustering of these 128 different cancers showed modular characteristics. In another case study, we used the gene similarity application on 361 obesity-related genes. The results revealed the complex pathogenesis of obesity. In addition, the gene module detection and gene module multilayered annotation functions in DOSim when applied on these 361 obesity-related genes helped extend our understanding of the complex pathogenesis of obesity risk phenotypes and the heterogeneity of obesity-related diseases. DOSim can be used to detect disease-driven gene modules, and to annotate the modules for functions and pathways. The DOSim package can also be used to visualise DO structure. DOSim can reflect the modular characteristic of disease related genes and promote our understanding of the complex pathogenesis of diseases. DOSim is available on the Comprehensive R Archive Network (CRAN) or http://bioinfo.hrbmu.edu.cn/dosim.

  3. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de. © The Author(s) 2014. Published by Oxford University Press.

  4. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation

    PubMed Central

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de PMID:24865352

  5. Genome-wide identification, characterization of sugar transporter genes in the silkworm Bombyx mori and role in Bombyx mori nucleopolyhedrovirus (BmNPV) infection.

    PubMed

    Govindaraj, Lekha; Gupta, Tania; Esvaran, Vijaya Gowri; Awasthi, Arvind Kumar; Ponnuvel, Kangayam M

    2016-04-01

    Sugar transporters play an essential role in controlling carbohydrate transport and are responsible for mediating the movement of sugars into cells. These genes exist as large multigene families within the insect genome. In insects, sugar transporters not only have a role in sugar transport, but may also act as receptors for virus entry. Genome-wide annotation of silkworm Bombyx mori (B. mori) revealed 100 putative sugar transporter (BmST) genes exists as a large multigene family and were classified into 11 sub families, through phylogenetic analysis. Chromosomes 27, 26 and 20 were found to possess the highest number of BmST paralogous genes, harboring 22, 7 and 6 genes, respectively. These genes occurred in clusters exhibiting the phenomenon of tandem gene duplication. The ovary, silk gland, hemocytes, midgut and malphigian tubules were the different tissues/cells enriched with BmST gene expression. The BmST gene BGIBMGA001498 had maximum EST transcripts of 134 and expressed exclusively in the malphigian tubule. The expression of EST transcripts of the BmST clustered genes on chromosome 27 was distributed in various tissues like testis, ovary, silk gland, malphigian tubule, maxillary galea, prothoracic gland, epidermis, fat body and midgut. Three sugar transporter genes (BmST) were constitutively expressed in the susceptible race and were down regulated upon BmNPV infection at 12h post infection (hpi). The expression pattern of these three genes was validated through real-time PCR in the midgut tissues at different time intervals from 0 to 30hpi. In the susceptible B. mori race, expression of sugar transporter genes was constitutively expressed making the host succumb to viral infection. Copyright © 2015 Elsevier B.V. All rights reserved.

  6. Annotate-it: a Swiss-knife approach to annotation, analysis and interpretation of single nucleotide variation in human disease

    PubMed Central

    2012-01-01

    The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org. PMID:23013645

  7. A new approach for annotation of transposable elements using small RNA mapping

    PubMed Central

    El Baidouri, Moaine; Kim, Kyung Do; Abernathy, Brian; Arikit, Siwaret; Maumus, Florian; Panaud, Olivier; Meyers, Blake C.; Jackson, Scott A.

    2015-01-01

    Transposable elements (TEs) are mobile genomic DNA sequences found in most organisms. They so densely populate the genomes of many eukaryotic species that they are often the major constituents. With the rapid generation of many plant genome sequencing projects over the past few decades, there is an urgent need for improved TE annotation as a prerequisite for genome-wide studies. Analogous to the use of RNA-seq for gene annotation, we propose a new method for de novo TE annotation that uses as a guide 24 nt-siRNAs that are a part of TE silencing pathways. We use this new approach, called TASR (for Transposon Annotation using Small RNAs), for de novo annotation of TEs in Arabidopsis, rice and soybean and demonstrate that this strategy can be successfully applied for de novo TE annotation in plants. Executable PERL is available for download from: http://tasr-pipeline.sourceforge.net/ PMID:25813049

  8. AbsIDconvert: An absolute approach for converting genetic identifiers at different granularities

    PubMed Central

    2012-01-01

    Background High-throughput molecular biology techniques yield vast amounts of data, often by detecting small portions of ribonucleotides corresponding to specific identifiers. Existing bioinformatic methodologies categorize and compare these elements using inferred descriptive annotation given this sequence information irrespective of the fact that it may not be representative of the identifier as a whole. Results All annotations, no matter the granularity, can be aligned to genomic sequences and therefore annotated by genomic intervals. We have developed AbsIDconvert, a methodology for converting between genomic identifiers by first mapping them onto a common universal coordinate system using an interval tree which is subsequently queried for overlapping identifiers. AbsIDconvert has many potential uses, including gene identifier conversion, identification of features within a genomic region, and cross-species comparisons. The utility is demonstrated in three case studies: 1) comparative genomic study mapping plasmodium gene sequences to corresponding human and mosquito transcriptional regions; 2) cross-species study of Incyte clone sequences; and 3) analysis of human Ensembl transcripts mapped by Affymetrix®; and Agilent microarray probes. AbsIDconvert currently supports ID conversion of 53 species for a given list of input identifiers, genomic sequence, or genome intervals. Conclusion AbsIDconvert provides an efficient and reliable mechanism for conversion between identifier domains of interest. The flexibility of this tool allows for custom definition identifier domains contingent upon the availability and determination of a genomic mapping interval. As the genomes and the sequences for genetic elements are further refined, this tool will become increasingly useful and accurate. AbsIDconvert is freely available as a web application or downloadable as a virtual machine at: http://bioinformatics.louisville.edu/abid/. PMID:22967011

  9. CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics.

    PubMed

    Gai, Xiaowu; Perin, Juan C; Murphy, Kevin; O'Hara, Ryan; D'arcy, Monica; Wenocur, Adam; Xie, Hongbo M; Rappaport, Eric F; Shaikh, Tamim H; White, Peter S

    2010-02-04

    Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist. We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV. To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects. Available on the web at: http://sourceforge.net/projects/cnv.

  10. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis

    PubMed Central

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-01-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765

  11. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis.

    PubMed

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-12-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or 'expressology', thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). © 2013 The Authors The Plant Journal © 2013 John Wiley & Sons Ltd.

  12. A high resolution atlas of gene expression in the domestic sheep (Ovis aries)

    PubMed Central

    Farquhar, Iseabail L.; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G.; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C. Bruce; Freeman, Tom C.; Archibald, Alan L.; Hume, David A.

    2017-01-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of ‘guilt by association’ was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages. PMID:28915238

  13. A high resolution atlas of gene expression in the domestic sheep (Ovis aries).

    PubMed

    Clark, Emily L; Bush, Stephen J; McCulloch, Mary E B; Farquhar, Iseabail L; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G; Wu, Chunlei; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C Bruce; Freeman, Tom C; Summers, Kim M; Archibald, Alan L; Hume, David A

    2017-09-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of 'guilt by association' was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages.

  14. Similar Ratios of Introns to Intergenic Sequence across Animal Genomes

    PubMed Central

    Wörheide, Gert

    2017-01-01

    Abstract One central goal of genome biology is to understand how the usage of the genome differs between organisms. Our knowledge of genome composition, needed for downstream inferences, is critically dependent on gene annotations, yet problems associated with gene annotation and assembly errors are usually ignored in comparative genomics. Here, we analyze the genomes of 68 species across 12 animal phyla and some single-cell eukaryotes for general trends in genome composition and transcription, taking into account problems of gene annotation. We show that, regardless of genome size, the ratio of introns to intergenic sequence is comparable across essentially all animals, with nearly all deviations dominated by increased intergenic sequence. Genomes of model organisms have ratios much closer to 1:1, suggesting that the majority of published genomes of nonmodel organisms are underannotated and consequently omit substantial numbers of genes, with likely negative impact on evolutionary interpretations. Finally, our results also indicate that most animals transcribe half or more of their genomes arguing against differences in genome usage between animal groups, and also suggesting that the transcribed portion is more dependent on genome size than previously thought. PMID:28633296

  15. Phage phenomics: Physiological approaches to characterize novel viral proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  16. Phage phenomics: Physiological approaches to characterize novel viral proteins

    DOE PAGES

    Sanchez, Savannah E.; Cuevas, Daniel A.; Rostron, Jason E.; ...

    2015-06-11

    Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 - 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysismore » by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Thus, representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.« less

  17. Changes in the transcriptomic profiles of maize roots in response to iron-deficiency stress.

    PubMed

    Li, Yan; Wang, Nian; Zhao, Fengtao; Song, Xuejiao; Yin, Zhaohua; Huang, Rong; Zhang, Chunqing

    2014-07-01

    Plants are often subjected to iron (Fe)-deficiency stress because of its low solubility. Plants have evolved two distinct strategies to solubilize and transport Fe to acclimate to this abiotic stress condition. Transcriptomic profiling analysis was performed using Illumina digital gene expression to understand the mechanism underlying resistance responses of roots to Fe starvation in maize, an important Strategy II plant. A total of 3,427, 4,069, 4,881, and 2,610 genes had significantly changed expression levels after Fe-deficiency treatments of 1, 2, 4 or 7 days, respectively. Genes involved in 2'-deoxymugineic acid (DMA) synthesis, secretion, and Fe(III)-DMA uptake were significantly induced. Many genes related to plant hormones, protein kinases, and protein phosphatases responded to Fe-deficiency stress, suggesting their regulatory roles in response to the Fe-deficiency stress. Functional annotation clustering analysis, using the Database for Annotation, Visualization and Integrated Discovery, revealed maize root responses to Fe starvation. This resulted in 38 functional annotation clusters: 25 for up-regulated genes, and 13 for down-regulated ones. These included genes encoding enzymes involved in the metabolism of carboxylic acids, isoprenoids and aromatic compounds, transporters, and stress response proteins. Our work provides integrated information for understanding maize response to Fe-deficiency stress.

  18. A detailed transcript-level probe annotation reveals alternative splicing based microarray platform differences

    PubMed Central

    Lee, Joseph C; Stiles, David; Lu, Jun; Cam, Margaret C

    2007-01-01

    Background Microarrays are a popular tool used in experiments to measure gene expression levels. Improving the reproducibility of microarray results produced by different chips from various manufacturers is important to create comparable and combinable experimental results. Alternative splicing has been cited as a possible cause of differences in expression measurements across platforms, though no study to this point has been conducted to show its influence in cross-platform differences. Results Using probe sequence data, a new microarray probe/transcript annotation was created based on the AceView Aug05 release that allowed for the categorization of genes based on their expression measurements' susceptibility to alternative splicing differences across microarray platforms. Examining gene expression data from multiple platforms in light of the new categorization, genes unsusceptible to alternative splicing differences showed higher signal agreement than those genes most susceptible to alternative splicing differences. The analysis gave rise to a different probe-level visualization method that can highlight probe differences according to transcript specificity. Conclusion The results highlight the need for detailed probe annotation at the transcriptome level. The presence of alternative splicing within a given sample can affect gene expression measurements and is a contributing factor to overall technical differences across platforms. PMID:17708771

  19. GeneView: a comprehensive semantic search engine for PubMed.

    PubMed

    Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf

    2012-07-01

    Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.

  20. Draft genome of the red harvester ant Pogonomyrmex barbatus.

    PubMed

    Smith, Chris R; Smith, Christopher D; Robertson, Hugh M; Helmkampf, Martin; Zimin, Aleksey; Yandell, Mark; Holt, Carson; Hu, Hao; Abouheif, Ehab; Benton, Richard; Cash, Elizabeth; Croset, Vincent; Currie, Cameron R; Elhaik, Eran; Elsik, Christine G; Favé, Marie-Julie; Fernandes, Vilaiwan; Gibson, Joshua D; Graur, Dan; Gronenberg, Wulfila; Grubbs, Kirk J; Hagen, Darren E; Viniegra, Ana Sofia Ibarraran; Johnson, Brian R; Johnson, Reed M; Khila, Abderrahman; Kim, Jay W; Mathis, Kaitlyn A; Munoz-Torres, Monica C; Murphy, Marguerite C; Mustard, Julie A; Nakamura, Rin; Niehuis, Oliver; Nigam, Surabhi; Overson, Rick P; Placek, Jennifer E; Rajakumar, Rajendhran; Reese, Justin T; Suen, Garret; Tao, Shu; Torres, Candice W; Tsutsui, Neil D; Viljakainen, Lumi; Wolschin, Florian; Gadau, Jürgen

    2011-04-05

    We report the draft genome sequence of the red harvester ant, Pogonomyrmex barbatus. The genome was sequenced using 454 pyrosequencing, and the current assembly and annotation were completed in less than 1 y. Analyses of conserved gene groups (more than 1,200 manually annotated genes to date) suggest a high-quality assembly and annotation comparable to recently sequenced insect genomes using Sanger sequencing. The red harvester ant is a model for studying reproductive division of labor, phenotypic plasticity, and sociogenomics. Although the genome of P. barbatus is similar to other sequenced hymenopterans (Apis mellifera and Nasonia vitripennis) in GC content and compositional organization, and possesses a complete CpG methylation toolkit, its predicted genomic CpG content differs markedly from the other hymenopterans. Gene networks involved in generating key differences between the queen and worker castes (e.g., wings and ovaries) show signatures of increased methylation and suggest that ants and bees may have independently co-opted the same gene regulatory mechanisms for reproductive division of labor. Gene family expansions (e.g., 344 functional odorant receptors) and pseudogene accumulation in chemoreception and P450 genes compared with A. mellifera and N. vitripennis are consistent with major life-history changes during the adaptive radiation of Pogonomyrmex spp., perhaps in parallel with the development of the North American deserts.

  1. Deep developmental transcriptome sequencing uncovers numerous new genes and enhances gene annotation in the sponge Amphimedon queenslandica.

    PubMed

    Fernandez-Valverde, Selene L; Calcino, Andrew D; Degnan, Bernard M

    2015-05-15

    The demosponge Amphimedon queenslandica is amongst the few early-branching metazoans with an assembled and annotated draft genome, making it an important species in the study of the origin and early evolution of animals. Current gene models in this species are largely based on in silico predictions and low coverage expressed sequence tag (EST) evidence. Amphimedon queenslandica protein-coding gene models are improved using deep RNA-Seq data from four developmental stages and CEL-Seq data from 82 developmental samples. Over 86% of previously predicted genes are retained in the new gene models, although 24% have additional exons; there is also a marked increase in the total number of annotated 3' and 5' untranslated regions (UTRs). Importantly, these new developmental transcriptome data reveal numerous previously unannotated protein-coding genes in the Amphimedon genome, increasing the total gene number by 25%, from 30,060 to 40,122. In general, Amphimedon genes have introns that are markedly smaller than those in other animals and most of the alternatively spliced genes in Amphimedon undergo intron-retention; exon-skipping is the least common mode of alternative splicing. Finally, in addition to canonical polyadenylation signal sequences, Amphimedon genes are enriched in a number of unique AT-rich motifs in their 3' UTRs. The inclusion of developmental transcriptome data has substantially improved the structure and composition of protein-coding gene models in Amphimedon queenslandica, providing a more accurate and comprehensive set of genes for functional and comparative studies. These improvements reveal the Amphimedon genome is comprised of a remarkably high number of tightly packed genes. These genes have small introns and there is pervasive intron retention amongst alternatively spliced transcripts. These aspects of the sponge genome are more similar unicellular opisthokont genomes than to other animal genomes.

  2. RNA-Seq analysis and transcriptome assembly for blackberry (Rubus sp. Var. Lochness) fruit.

    PubMed

    Garcia-Seco, Daniel; Zhang, Yang; Gutierrez-Mañero, Francisco J; Martin, Cathie; Ramos-Solano, Beatriz

    2015-01-22

    There is an increasing interest in berries, especially blackberries in the diet, because of recent reports of their health benefits due to their high content of flavonoids. A broad range of genomic tools are available for other Rosaceae species but these tools are still lacking in the Rubus genus, thus limiting gene discovery and the breeding of improved varieties. De novo RNA-seq of ripe blackberries grown under field conditions was performed using Illumina Hiseq 2000. Almost 9 billion nucleotide bases were sequenced in total. Following assembly, 42,062 consensus sequences were detected. For functional annotation, 33,040 (NR), 32,762 (NT), 21,932 (Swiss-Prot), 20,134 (KEGG), 13,676 (COG), 24,168 (GO) consensus sequences were annotated using different databases; in total 34,552 annotated sequences were identified. For protein prediction analysis, the number of coding DNA sequences (CDS) that mapped to the protein database was 32,540. Non redundant (NR), annotation showed that 25,418 genes (73.5%) has the highest similarity with Fragaria vesca subspecies vesca. Reanalysis was undertaken by aligning the reads with this reference genome for a deeper analysis of the transcriptome. We demonstrated that de novo assembly, using Trinity and later annotation with Blast using different databases, were complementary to alignment to the reference sequence using SOAPaligner/SOAP2. The Fragaria reference genome belongs to a species in the same family as blackberry (Rosaceae) but to a different genus. Since blackberries are tetraploids, the possibility of artefactual gene chimeras resulting from mis-assembly was tested with one of the genes sequenced by RNAseq, Chalcone Synthase (CHS). cDNAs encoding this protein were cloned and sequenced. Primers designed to the assembled sequences accurately distinguished different contigs, at least for chalcone synthase genes. We prepared and analysed transcriptome data from ripe blackberries, for which prior genomic information was limited. This new sequence information will improve the knowledge of this important and healthy fruit, providing an invaluable new tool for biological research.

  3. A comprehensive collection of annotations to interpret sequence variation in human mitochondrial transfer RNAs.

    PubMed

    Diroma, Maria Angela; Lubisco, Paolo; Attimonelli, Marcella

    2016-11-08

    The abundance of biological data characterizing the genomics era is contributing to a comprehensive understanding of human mitochondrial genetics. Nevertheless, many aspects are still unclear, specifically about the variability of the 22 human mitochondrial transfer RNA (tRNA) genes and their involvement in diseases. The complex enrichment and isolation of tRNAs in vitro leads to an incomplete knowledge of their post-transcriptional modifications and three-dimensional folding, essential for correct tRNA functioning. An accurate annotation of mitochondrial tRNA variants would be definitely useful and appreciated by mitochondrial researchers and clinicians since the most of bioinformatics tools for variant annotation and prioritization available so far cannot shed light on the functional role of tRNA variations. To this aim, we updated our MToolBox pipeline for mitochondrial DNA analysis of high throughput and Sanger sequencing data by integrating tRNA variant annotations in order to identify and characterize relevant variants not only in protein coding regions, but also in tRNA genes. The annotation step in the pipeline now provides detailed information for variants mapping onto the 22 mitochondrial tRNAs. For each mt-tRNA position along the entire genome, the relative tRNA numbering, tRNA type, cloverleaf secondary domains (loops and stems), mature nucleotide and interactions in the three-dimensional folding were reported. Moreover, pathogenicity predictions for tRNA and rRNA variants were retrieved from the literature and integrated within the annotations provided by MToolBox, both in the stand-alone version and web-based tool at the Mitochondrial Disease Sequence Data Resource (MSeqDR) website. All the information available in the annotation step of MToolBox were exploited to generate custom tracks which can be displayed in the GBrowse instance at MSeqDR website. To the best of our knowledge, specific data regarding mitochondrial variants in tRNA genes were introduced for the first time in a tool for mitochondrial genome analysis, supporting the interpretation of genetic variants in specific genomic contexts.

  4. Towards the integration, annotation and association of historical microarray experiments with RNA-seq.

    PubMed

    Chavan, Shweta S; Bauer, Michael A; Peterson, Erich A; Heuck, Christoph J; Johann, Donald J

    2013-01-01

    Transcriptome analysis by microarrays has produced important advances in biomedicine. For instance in multiple myeloma (MM), microarray approaches led to the development of an effective disease subtyping via cluster assignment, and a 70 gene risk score. Both enabled an improved molecular understanding of MM, and have provided prognostic information for the purposes of clinical management. Many researchers are now transitioning to Next Generation Sequencing (NGS) approaches and RNA-seq in particular, due to its discovery-based nature, improved sensitivity, and dynamic range. Additionally, RNA-seq allows for the analysis of gene isoforms, splice variants, and novel gene fusions. Given the voluminous amounts of historical microarray data, there is now a need to associate and integrate microarray and RNA-seq data via advanced bioinformatic approaches. Custom software was developed following a model-view-controller (MVC) approach to integrate Affymetrix probe set-IDs, and gene annotation information from a variety of sources. The tool/approach employs an assortment of strategies to integrate, cross reference, and associate microarray and RNA-seq datasets. Output from a variety of transcriptome reconstruction and quantitation tools (e.g., Cufflinks) can be directly integrated, and/or associated with Affymetrix probe set data, as well as necessary gene identifiers and/or symbols from a diversity of sources. Strategies are employed to maximize the annotation and cross referencing process. Custom gene sets (e.g., MM 70 risk score (GEP-70)) can be specified, and the tool can be directly assimilated into an RNA-seq pipeline. A novel bioinformatic approach to aid in the facilitation of both annotation and association of historic microarray data, in conjunction with richer RNA-seq data, is now assisting with the study of MM cancer biology.

  5. Combined Chromatin and Expression Analysis Reveals Specific Regulatory Mechanisms within Cytokine Genes in the Macrophage Early Immune Response

    PubMed Central

    Emanuelsson, Olof; Sennblad, Bengt; Pirmoradian Najafabadi, Mohammad; Folkersen, Lasse; Mälarstig, Anders; Lagergren, Jens; Eriksson, Per; Hamsten, Anders; Odeberg, Jacob

    2012-01-01

    Macrophages play a critical role in innate immunity, and the expression of early response genes orchestrate much of the initial response of the immune system. Macrophages undergo extensive transcriptional reprogramming in response to inflammatory stimuli such as Lipopolysaccharide (LPS). To identify gene transcription regulation patterns involved in early innate immune responses, we used two genome-wide approaches - gene expression profiling and chromatin immunoprecipitation-sequencing (ChIP-seq) analysis. We examined the effect of 2 hrs LPS stimulation on early gene expression and its relation to chromatin remodeling (H3 acetylation; H3Ac) and promoter binding of Sp1 and RNA polymerase II phosphorylated at serine 5 (S5P RNAPII), which is a marker for transcriptional initiation. Our results indicate novel and alternative gene regulatory mechanisms for certain proinflammatory genes. We identified two groups of up-regulated inflammatory genes with respect to chromatin modification and promoter features. One group, including highly up-regulated genes such as tumor necrosis factor (TNF), was characterized by H3Ac, high CpG content and lack of TATA boxes. The second group, containing inflammatory mediators (interleukins and CCL chemokines), was up-regulated upon LPS stimulation despite lacking H3Ac in their annotated promoters, which were low in CpG content but did contain TATA boxes. Genome-wide analysis showed that few H3Ac peaks were unique to either +/−LPS condition. However, within these, an unpacking/expansion of already existing H3Ac peaks was observed upon LPS stimulation. In contrast, a significant proportion of S5P RNAPII peaks (approx 40%) was unique to either condition. Furthermore, data indicated a large portion of previously unannotated TSSs, particularly in LPS-stimulated macrophages, where only 28% of unique S5P RNAPII peaks overlap annotated promoters. The regulation of the inflammatory response appears to occur in a very specific manner at the chromatin level for specific genes and this study highlights the level of fine-tuning that occurs in the immune response. PMID:22384210

  6. Wide coverage biomedical event extraction using multiple partially overlapping corpora

    PubMed Central

    2013-01-01

    Background Biomedical events are key to understanding physiological processes and disease, and wide coverage extraction is required for comprehensive automatic analysis of statements describing biomedical systems in the literature. In turn, the training and evaluation of extraction methods requires manually annotated corpora. However, as manual annotation is time-consuming and expensive, any single event-annotated corpus can only cover a limited number of semantic types. Although combined use of several such corpora could potentially allow an extraction system to achieve broad semantic coverage, there has been little research into learning from multiple corpora with partially overlapping semantic annotation scopes. Results We propose a method for learning from multiple corpora with partial semantic annotation overlap, and implement this method to improve our existing event extraction system, EventMine. An evaluation using seven event annotated corpora, including 65 event types in total, shows that learning from overlapping corpora can produce a single, corpus-independent, wide coverage extraction system that outperforms systems trained on single corpora and exceeds previously reported results on two established event extraction tasks from the BioNLP Shared Task 2011. Conclusions The proposed method allows the training of a wide-coverage, state-of-the-art event extraction system from multiple corpora with partial semantic annotation overlap. The resulting single model makes broad-coverage extraction straightforward in practice by removing the need to either select a subset of compatible corpora or semantic types, or to merge results from several models trained on different individual corpora. Multi-corpus learning also allows annotation efforts to focus on covering additional semantic types, rather than aiming for exhaustive coverage in any single annotation effort, or extending the coverage of semantic types annotated in existing corpora. PMID:23731785

  7. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction.

    PubMed

    Stojanova, Daniela; Ceci, Michelangelo; Malerba, Donato; Dzeroski, Saso

    2013-09-26

    Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.

  8. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database

    PubMed Central

    Drabkin, Harold J.; Blake, Judith A.

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications. PMID:23110975

  9. Genome improvement of the acarbose producer Actinoplanes sp. SE50/110 and annotation refinement based on RNA-seq analysis.

    PubMed

    Wolf, Timo; Schneiker-Bekel, Susanne; Neshat, Armin; Ortseifen, Vera; Wibberg, Daniel; Zemke, Till; Pühler, Alfred; Kalinowski, Jörn

    2017-06-10

    Actinoplanes sp. SE50/110 is the natural producer of acarbose, which is used in the treatment of diabetes mellitus type II. However, until now the transcriptional organization and regulation of the acarbose biosynthesis are only understood rudimentarily. The genome sequence of Actinoplanes sp. SE50/110 was known before, but was resequenced in this study to remove assembly artifacts and incorrect base callings. The annotation of the genome was refined in a multi-step approach, including modern bioinformatic pipelines, transcriptome and proteome data. A whole transcriptome RNA-seq library as well as an RNA-seq library enriched for primary 5'-ends were used for the detection of transcription start sites, to correct tRNA predictions, to identify novel transcripts like small RNAs and to improve the annotation through the correction of falsely annotated translation start sites. The transcriptome data sets were also applied to identify 31 cis-regulatory RNA structures, such as riboswitches or RNA thermometers as well as three leaderless transcribed short peptides found in putative attenuators upstream of genes for amino acid biosynthesis. The transcriptional organization of the acarbose biosynthetic gene cluster was elucidated in detail and fourteen novel biosynthetic gene clusters were suggested. The accurate genome sequence and precise annotation of the Actinoplanes sp. SE50/110 genome will be the foundation for future genetic engineering and systems biology studies. Copyright © 2017 Elsevier B.V. All rights reserved.

  10. ODG: Omics database generator - a tool for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding.

    PubMed

    Guhlin, Joseph; Silverstein, Kevin A T; Zhou, Peng; Tiffin, Peter; Young, Nevin D

    2017-08-10

    Rapid generation of omics data in recent years have resulted in vast amounts of disconnected datasets without systemic integration and knowledge building, while individual groups have made customized, annotated datasets available on the web with few ways to link them to in-lab datasets. With so many research groups generating their own data, the ability to relate it to the larger genomic and comparative genomic context is becoming increasingly crucial to make full use of the data. The Omics Database Generator (ODG) allows users to create customized databases that utilize published genomics data integrated with experimental data which can be queried using a flexible graph database. When provided with omics and experimental data, ODG will create a comparative, multi-dimensional graph database. ODG can import definitions and annotations from other sources such as InterProScan, the Gene Ontology, ENZYME, UniPathway, and others. This annotation data can be especially useful for studying new or understudied species for which transcripts have only been predicted, and rapidly give additional layers of annotation to predicted genes. In better studied species, ODG can perform syntenic annotation translations or rapidly identify characteristics of a set of genes or nucleotide locations, such as hits from an association study. ODG provides a web-based user-interface for configuring the data import and for querying the database. Queries can also be run from the command-line and the database can be queried directly through programming language hooks available for most languages. ODG supports most common genomic formats as well as generic, easy to use tab-separated value format for user-provided annotations. ODG is a user-friendly database generation and query tool that adapts to the supplied data to produce a comparative genomic database or multi-layered annotation database. ODG provides rapid comparative genomic annotation and is therefore particularly useful for non-model or understudied species. For species for which more data are available, ODG can be used to conduct complex multi-omics, pattern-matching queries.

  11. MyWEST: my Web Extraction Software Tool for effective mining of annotations from web-based databanks.

    PubMed

    Masseroli, Marco; Stella, Andrea; Meani, Natalia; Alcalay, Myriam; Pinciroli, Francesco

    2004-12-12

    High-throughput technologies create the necessity to mine large amounts of gene annotations from diverse databanks, and to integrate the resulting data. Most databanks can be interrogated only via Web, for a single gene at a time, and query results are generally available only in the HTML format. Although some databanks provide batch retrieval of data via FTP, this requires expertise and resources for locally reimplementing the databank. We developed MyWEST, a tool aimed at researchers without extensive informatics skills or resources, which exploits user-defined templates to easily mine selected annotations from different Web-interfaced databanks, and aggregates and structures results in an automatically updated database. Using microarray results from a model system of retinoic acid-induced differentiation, MyWEST effectively gathered relevant annotations from various biomolecular databanks, highlighted significant biological characteristics and supported a global approach to the understanding of complex cellular mechanisms. MyWEST is freely available for non-profit use at http://www.medinfopoli.polimi.it/MyWEST/

  12. Identification of Candidate Genes Responsible for Stem Pith Production Using Expression Analysis in Solid-Stemmed Wheat.

    PubMed

    Oiestad, A J; Martin, J M; Cook, J; Varella, A C; Giroux, M J

    2017-07-01

    The wheat stem sawfly (WSS) is an economically important pest of wheat in the Northern Great Plains. The primary means of WSS control is resistance associated with the single quantitative trait locus (QTL) , which controls most stem solidness variation. The goal of this study was to identify stem solidness candidate genes via RNA-seq. This study made use of 28 single nucleotide polymorphism (SNP) makers derived from expressed sequence tags (ESTs) linked to contained within a 5.13 cM region. Allele specific expression of EST markers was examined in stem tissue for solid and hollow-stemmed pairs of two spring wheat near isogenic lines (NILs) differing for the QTL. Of the 28 ESTs, 13 were located within annotated genes and 10 had detectable stem expression. Annotated genes corresponding to four of the ESTs were differentially expressed between solid and hollow-stemmed NILs and represent possible stem solidness gene candidates. Further examination of the 5.13 cM region containing the 28 EST markers identified 260 annotated genes. Twenty of the 260 linked genes were up-regulated in hollow NIL stems, while only seven genes were up-regulated in solid NIL stems. An -methyltransferase within the region of interest was identified as a candidate based on differential expression between solid and hollow-stemmed NILs and putative function. Further study of these candidate genes may lead to the identification of the gene(s) controlling stem solidness and an increased ability to select for wheat stem solidness and manage WSS. Copyright © 2017 Crop Science Society of America.

  13. aGEM: an integrative system for analyzing spatial-temporal gene-expression information

    PubMed Central

    Jiménez-Lozano, Natalia; Segura, Joan; Macías, José Ramón; Vega, Juanjo; Carazo, José María

    2009-01-01

    Motivation: The work presented here describes the ‘anatomical Gene-Expression Mapping (aGEM)’ Platform, a development conceived to integrate phenotypic information with the spatial and temporal distributions of genes expressed in the mouse. The aGEM Platform has been built by extending the Distributed Annotation System (DAS) protocol, which was originally designed to share genome annotations over the WWW. DAS is a client-server system in which a single client integrates information from multiple distributed servers. Results: The aGEM Platform provides information to answer three main questions. (i) Which genes are expressed in a given mouse anatomical component? (ii) In which mouse anatomical structures are a given gene or set of genes expressed? And (iii) is there any correlation among these findings? Currently, this Platform includes several well-known mouse resources (EMAGE, GXD and GENSAT), hosting gene-expression data mostly obtained from in situ techniques together with a broad set of image-derived annotations. Availability: The Platform is optimized for Firefox 3.0 and it is accessed through a friendly and intuitive display: http://agem.cnb.csic.es Contact: natalia@cnb.csic.es Supplementary information: Supplementary data are available at http://bioweb.cnb.csic.es/VisualOmics/aGEM/home.html and http://bioweb.cnb.csic.es/VisualOmics/index_VO.html and Bioinformatics online. PMID:19592395

  14. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources.

    PubMed

    Karchin, Rachel; Diekhans, Mark; Kelly, Libusha; Thomas, Daryl J; Pieper, Ursula; Eswar, Narayanan; Haussler, David; Sali, Andrej

    2005-06-15

    The NCBI dbSNP database lists over 9 million single nucleotide polymorphisms (SNPs) in the human genome, but currently contains limited annotation information. SNPs that result in amino acid residue changes (nsSNPs) are of critical importance in variation between individuals, including disease and drug sensitivity. We have developed LS-SNP, a genomic scale software pipeline to annotate nsSNPs. LS-SNP comprehensively maps nsSNPs onto protein sequences, functional pathways and comparative protein structure models, and predicts positions where nsSNPs destabilize proteins, interfere with the formation of domain-domain interfaces, have an effect on protein-ligand binding or severely impact human health. It currently annotates 28,043 validated SNPs that produce amino acid residue substitutions in human proteins from the SwissProt/TrEMBL database. Annotations can be viewed via a web interface either in the context of a genomic region or by selecting sets of SNPs, genes, proteins or pathways. These results are useful for identifying candidate functional SNPs within a gene, haplotype or pathway and in probing molecular mechanisms responsible for functional impacts of nsSNPs. http://www.salilab.org/LS-SNP CONTACT: rachelk@salilab.org http://salilab.org/LS-SNP/supp-info.pdf.

  15. Development of Cymbidium ensifolium genic-SSR markers and their utility in genetic diversity and population structure analysis in cymbidiums.

    PubMed

    Li, Xiaobai; Jin, Feng; Jin, Liang; Jackson, Aaron; Huang, Cheng; Li, Kehu; Shu, Xiaoli

    2014-12-05

    Cymbidium is a genus of 68 species in the orchid family, with extremely high ornamental value. Marker-assisted selection has proven to be an effective strategy in accelerating plant breeding for many plant species. Analysis of cymbidiums genetic background by molecular markers can be of great value in assisting parental selection and breeding strategy design, however, in plants such as cymbidiums limited genomic resources exist. In order to obtain efficient markers, we deep sequenced the C. ensifolium transcriptome to identify simple sequence repeats derived from gene regions (genic-SSR). The 7,936 genic-SSR markers were identified. A total of 80 genic-SSRs were selected, and primers were designed according to their flanking sequences. Of the 80 genic-SSR primer sets, 62 were amplified in C. ensifolium successfully, and 55 showed polymorphism when cross-tested among 9 Cymbidium species comprising 59 accessions. Unigenes containing the 62 genic-SSRs were searched against Non-redundant (Nr), Gene Ontology database (GO), eukaryotic orthologous groups (KOGs) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The search resulted in 53 matching Nr sequences, of which 39 had GO terms, 18 were assigned to KOGs, and 15 were annotated with KEGG. Genetic diversity and population structure were analyzed based on 55 polymorphic genic-SSR data among 59 accessions. The genetic distance averaged 0.3911, ranging from 0.016 to 0.618. The polymorphic index content (PIC) of 55 polymorphic markers averaged 0.407, ranging from 0.033 to 0.863. A model-based clustering analysis revealed that five genetic groups existed in the collection. Accessions from the same species were typically grouped together; however, C. goeringii accessions did not always form a separate cluster, suggesting that C. goeringii accessions were polyphyletic. The genic-SSR identified in this study constitute a set of markers that can be applied across multiple Cymbidium species and used for the evaluation of genetic relationships as well as qualitative and quantitative trait mapping studies. Genic-SSR's coupled with the functional annotations provided by the unigenes will aid in mapping candidate genes of specific function.

  16. Influence of white spot syndrome virus infection on hepatopancreas gene expression of `Huanghai No. 2' shrimp ( Fenneropenaeus chinensis)

    NASA Astrophysics Data System (ADS)

    Meng, Xianhong; Shi, Xiaoli; Kong, Jie; Luan, Sheng; Luo, Kun; Cao, Baoxiang; Liu, Ning; Lu, Xia; Li, Xupeng; Deng, Kangyu; Cao, Jiawang; Zhang, Yingxue; Zhang, Hengheng

    2017-10-01

    To elucidate the molecular response of shrimp hepatopancreas to white spot syndrome virus (WSSV) infection, microarray was applied to investigate the differentially expressed genes in the hepatopancreas of `Huanghai No. 2' ( Fenneropenaeus chinensis). A total of 59137 unigenes were designed onto a custom-made 60K Agilent chip. After infection, the gene expression profiles in the hepatopancreas of the shrimp with a lower viral load at early (48-96 h), peak (168-192 h) and late (264-288 h) infection phases were analyzed. Of 18704 differentially expressed genes, 6412 were annotated. In total, 5453 differentially expressed genes (1916 annotated) expressed at all three phases, and most of the annotated were either up- or down-regulated continuously. These genes function diversely in, for example, immune response, cytoskeletal system, signal transduction, stress resistance, protein synthesis and processing, metabolism among others. Some of the immune-related genes, including antilipopolysaccharide factor, Kazal-type proteinase inhibitor, C-type lectin and serine protease encoding genes, were up-regulated after WSSV infection. These genes have been reported to be involved in the anti-WSSV responses. The expression of genes related to the cytoskeletal system, including β-actin and myosin but without tubulin genes, were down-regulated after WSSV infection. Astakine was found for the first time in the WSSV-infected F. chinensis. To further confirm the expression of differentially expressed genes, quantitative real-time PCR was performed to test the expression of eight randomly selected genes and verified the reliability and accuracy of the microarray expression analysis. The data will provide valuable information to understanding the immune mechanism of shrimp's response to WSSV.

  17. SAMMD: Staphylococcus aureus microarray meta-database.

    PubMed

    Nagarajan, Vijayaraj; Elasri, Mohamed O

    2007-10-02

    Staphylococcus aureus is an important human pathogen, causing a wide variety of diseases ranging from superficial skin infections to severe life threatening infections. S. aureus is one of the leading causes of nosocomial infections. Its ability to resist multiple antibiotics poses a growing public health problem. In order to understand the mechanism of pathogenesis of S. aureus, several global expression profiles have been developed. These transcriptional profiles included regulatory mutants of S. aureus and growth of wild type under different growth conditions. The abundance of these profiles has generated a large amount of data without a uniform annotation system to comprehensively examine them. We report the development of the Staphylococcus aureus Microarray meta-database (SAMMD) which includes data from all the published transcriptional profiles. SAMMD is a web-accessible database that helps users to perform a variety of analysis against and within the existing transcriptional profiles. SAMMD is a relational database that uses MySQL as the back end and PHP/JavaScript/DHTML as the front end. The database is normalized and consists of five tables, which holds information about gene annotations, regulated gene lists, experimental details, references, and other details. SAMMD data is collected from the peer-reviewed published articles. Data extraction and conversion was done using perl scripts while data entry was done through phpMyAdmin tool. The database is accessible via a web interface that contains several features such as a simple search by ORF ID, gene name, gene product name, advanced search using gene lists, comparing among datasets, browsing, downloading, statistics, and help. The database is licensed under General Public License (GPL). SAMMD is hosted and available at http://www.bioinformatics.org/sammd/. Currently there are over 9500 entries for regulated genes, from 67 microarray experiments. SAMMD will help staphylococcal scientists to analyze their expression data and understand it at global level. It will also allow scientists to compare and contrast their transcriptome to that of the other published transcriptomes.

  18. From data to function: functional modeling of poultry genomics data.

    PubMed

    McCarthy, F M; Lyons, E

    2013-09-01

    One of the challenges of functional genomics is to create a better understanding of the biological system being studied so that the data produced are leveraged to provide gains for agriculture, human health, and the environment. Functional modeling enables researchers to make sense of these data as it reframes a long list of genes or gene products (mRNA, ncRNA, and proteins) by grouping based upon function, be it individual molecular functions or interactions between these molecules or broader biological processes, including metabolic and signaling pathways. However, poultry researchers have been hampered by a lack of functional annotation data, tools, and training to use these data and tools. Moreover, this lack is becoming more critical as new sequencing technologies enable us to generate data not only for an increasingly diverse range of species but also individual genomes and populations of individuals. We discuss the impact of these new sequencing technologies on poultry research, with a specific focus on what functional modeling resources are available for poultry researchers. We also describe key strategies for researchers who wish to functionally model their own data, providing background information about functional modeling approaches, the data and tools to support these approaches, and the strengths and limitations of each. Specifically, we describe methods for functional analysis using Gene Ontology (GO) functional summaries, functional enrichment analysis, and pathways and network modeling. As annotation efforts begin to provide the fundamental data that underpin poultry functional modeling (such as improved gene identification, standardized gene nomenclature, temporal and spatial expression data and gene product function), tool developers are incorporating these data into new and existing tools that are used for functional modeling, and cyberinfrastructure is being developed to provide the necessary extendibility and scalability for storing and analyzing these data. This process will support the efforts of poultry researchers to make sense of their functional genomics data sets, and we provide here a starting point for researchers who wish to take advantage of these tools.

  19. SAMMD: Staphylococcus aureus Microarray Meta-Database

    PubMed Central

    Nagarajan, Vijayaraj; Elasri, Mohamed O

    2007-01-01

    Background Staphylococcus aureus is an important human pathogen, causing a wide variety of diseases ranging from superficial skin infections to severe life threatening infections. S. aureus is one of the leading causes of nosocomial infections. Its ability to resist multiple antibiotics poses a growing public health problem. In order to understand the mechanism of pathogenesis of S. aureus, several global expression profiles have been developed. These transcriptional profiles included regulatory mutants of S. aureus and growth of wild type under different growth conditions. The abundance of these profiles has generated a large amount of data without a uniform annotation system to comprehensively examine them. We report the development of the Staphylococcus aureus Microarray meta-database (SAMMD) which includes data from all the published transcriptional profiles. SAMMD is a web-accessible database that helps users to perform a variety of analysis against and within the existing transcriptional profiles. Description SAMMD is a relational database that uses MySQL as the back end and PHP/JavaScript/DHTML as the front end. The database is normalized and consists of five tables, which holds information about gene annotations, regulated gene lists, experimental details, references, and other details. SAMMD data is collected from the peer-reviewed published articles. Data extraction and conversion was done using perl scripts while data entry was done through phpMyAdmin tool. The database is accessible via a web interface that contains several features such as a simple search by ORF ID, gene name, gene product name, advanced search using gene lists, comparing among datasets, browsing, downloading, statistics, and help. The database is licensed under General Public License (GPL). Conclusion SAMMD is hosted and available at . Currently there are over 9500 entries for regulated genes, from 67 microarray experiments. SAMMD will help staphylococcal scientists to analyze their expression data and understand it at global level. It will also allow scientists to compare and contrast their transcriptome to that of the other published transcriptomes. PMID:17910768

  20. Introduction to the fathead minnow genome browser and opportunities for collaborative development

    EPA Science Inventory

    Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minno...

  1. RELATIONSHIP BETWEEN PHYLOGENETIC DISTRIBUTION AND GENOMIC FEATURES IN NEUROSPORA CRASSA

    USDA-ARS?s Scientific Manuscript database

    In the post-genome era, insufficient functional annotation of predicted genes greatly restricts the potential of mining genome data. We demonstrate that an evolutionary approach, which is independent of functional annotation, has great potential as a tool for genome analysis. We chose the genome o...

  2. Comparative genomics of metabolic capacities of regulons controlled by cis-regulatory RNA motifs in bacteria.

    PubMed

    Sun, Eric I; Leyn, Semen A; Kazanov, Marat D; Saier, Milton H; Novichkov, Pavel S; Rodionov, Dmitry A

    2013-09-02

    In silico comparative genomics approaches have been efficiently used for functional prediction and reconstruction of metabolic and regulatory networks. Riboswitches are metabolite-sensing structures often found in bacterial mRNA leaders controlling gene expression on transcriptional or translational levels.An increasing number of riboswitches and other cis-regulatory RNAs have been recently classified into numerous RNA families in the Rfam database. High conservation of these RNA motifs provides a unique advantage for their genomic identification and comparative analysis. A comparative genomics approach implemented in the RegPredict tool was used for reconstruction and functional annotation of regulons controlled by RNAs from 43 Rfam families in diverse taxonomic groups of Bacteria. The inferred regulons include ~5200 cis-regulatory RNAs and more than 12000 target genes in 255 microbial genomes. All predicted RNA-regulated genes were classified into specific and overall functional categories. Analysis of taxonomic distribution of these categories allowed us to establish major functional preferences for each analyzed cis-regulatory RNA motif family. Overall, most RNA motif regulons showed predictable functional content in accordance with their experimentally established effector ligands. Our results suggest that some RNA motifs (including thiamin pyrophosphate and cobalamin riboswitches that control the cofactor metabolism) are widespread and likely originated from the last common ancestor of all bacteria. However, many more analyzed RNA motifs are restricted to a narrow taxonomic group of bacteria and likely represent more recent evolutionary innovations. The reconstructed regulatory networks for major known RNA motifs substantially expand the existing knowledge of transcriptional regulation in bacteria. The inferred regulons can be used for genetic experiments, functional annotations of genes, metabolic reconstruction and evolutionary analysis. The obtained genome-wide collection of reference RNA motif regulons is available in the RegPrecise database (http://regprecise.lbl.gov/).

  3. Cryptosporidium hominis gene catalog: a resource for the selection of novel Cryptosporidium vaccine candidates

    PubMed Central

    Ifeonu, Olukemi O.; Simon, Raphael; Tennant, Sharon M.; Sheoran, Abhineet S.; Daly, Maria C.; Felix, Victor; Kissinger, Jessica C.; Widmer, Giovanni; Levine, Myron M.; Tzipori, Saul; Silva, Joana C.

    2016-01-01

    Human cryptosporidiosis, caused primarily by Cryptosporidium hominis and a subset of Cryptosporidium parvum, is a major cause of moderate-to-severe diarrhea in children under 5 years of age in developing countries and can lead to nutritional stunting and death. Cryptosporidiosis is particularly severe and potentially lethal in immunocompromised hosts. Biological and technical challenges have impeded traditional vaccinology approaches to identify novel targets for the development of vaccines against C. hominis, the predominant species associated with human disease. We deemed that the existence of genomic resources for multiple species in the genus, including a much-improved genome assembly and annotation for C. hominis, makes a reverse vaccinology approach feasible. To this end, we sought to generate a searchable online resource, termed C. hominis gene catalog, which registers all C. hominis genes and their properties relevant for the identification and prioritization of candidate vaccine antigens, including physical attributes, properties related to antigenic potential and expression data. Using bioinformatic approaches, we identified ∼400 C. hominis genes containing properties typical of surface-exposed antigens, such as predicted glycosylphosphatidylinositol (GPI)-anchor motifs, multiple transmembrane motifs and/or signal peptides targeting the encoded protein to the secretory pathway. This set can be narrowed further, e.g. by focusing on potential GPI-anchored proteins lacking homologs in the human genome, but with homologs in the other Cryptosporidium species for which genomic data are available, and with low amino acid polymorphism. Additional selection criteria related to recombinant expression and purification include minimizing predicted post-translation modifications and potential disulfide bonds. Forty proteins satisfying these criteria were selected from 3745 proteins in the updated C. hominis annotation. The immunogenic potential of a few of these is currently being tested. Database URL: http://cryptogc.igs.umaryland.edu PMID:28095366

  4. Analysis and functional annotation of expressed sequence tags from in vitro cell lines of elasmobranchs: spiny dogfish shark (Squalus acanthias) and little skate (Leucoraja erinacea)

    PubMed Central

    Parton, Angela; Bayne, Christopher J.; Barnes, David W.

    2010-01-01

    Elasmobranchs are the most commonly used experimental models among the jawed, cartilaginous fish (Chondrichthyes). Previously we developed cell lines from embryos of two elasmobranchs, Squalus acanthias the spiny dogfish shark (SAE line), and Leucoraja erinacea the little skate (LEE-1 line). From these lines cDNA libraries were derived and expressed sequence tags (ESTs) generated. From the SAE cell line 4303 unique transcripts were identified, with 1848 of these representing unknown sequences (showing no BLASTX identification). From the LEE-1 cell line, 3660 unique transcripts were identified, and unknown, unique sequences totaled 1333. Gene Ontology (GO) annotation showed that GO assignments for the two cell lines were in general similar. These results suggest that the procedures used to derive the cell lines led to isolation of cell types of the same general embryonic origin from both species. The LEE-1 transcripts included GO categories “envelope” and “oxidoreductase activity” but the SAE transcripts did not. GO analysis of SAE transcripts identified the category “anatomical structure formation” that was not present in LEE-1 cells. Increased organelle compartments may exist within LEE-1 cells compared to SAE cells, and the higher oxidoreductase activity in LEE-1 cells may indicate a role for these cells in responses associated with innate immunity or in steroidogenesis. These EST libraries from elasmobranch cell lines provide information for assembly of genomic sequences and are useful in revealing gene diversity, new genes and molecular markers, as well as in providing means for elucidation of full-length cDNAs and probes for gene array analyses. This is the first study of this type with members of the Chondrichthyes. PMID:20471924

  5. Analysis and functional annotation of expressed sequence tags from in vitro cell lines of elasmobranchs: Spiny dogfish shark (Squalus acanthias) and little skate (Leucoraja erinacea).

    PubMed

    Parton, Angela; Bayne, Christopher J; Barnes, David W

    2010-09-01

    Elasmobranchs are the most commonly used experimental models among the jawed, cartilaginous fish (Chondrichthyes). Previously we developed cell lines from embryos of two elasmobranchs, Squalus acanthias the spiny dogfish shark (SAE line), and Leucoraja erinacea the little skate (LEE-1 line). From these lines cDNA libraries were derived and expressed sequence tags (ESTs) generated. From the SAE cell line 4303 unique transcripts were identified, with 1848 of these representing unknown sequences (showing no BLASTX identification). From the LEE-1 cell line, 3660 unique transcripts were identified, and unknown, unique sequences totaled 1333. Gene Ontology (GO) annotation showed that GO assignments for the two cell lines were in general similar. These results suggest that the procedures used to derive the cell lines led to isolation of cell types of the same general embryonic origin from both species. The LEE-1 transcripts included GO categories "envelope" and "oxidoreductase activity" but the SAE transcripts did not. GO analysis of SAE transcripts identified the category "anatomical structure formation" that was not present in LEE-1 cells. Increased organelle compartments may exist within LEE-1 cells compared to SAE cells, and the higher oxidoreductase activity in LEE-1 cells may indicate a role for these cells in responses associated with innate immunity or in steroidogenesis. These EST libraries from elasmobranch cell lines provide information for assembly of genomic sequences and are useful in revealing gene diversity, new genes and molecular markers, as well as in providing means for elucidation of full-length cDNAs and probes for gene array analyses. This is the first study of this type with members of the Chondrichthyes. Copyright 2010 Elsevier Inc. All rights reserved.

  6. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications

    PubMed Central

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H.; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene–patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene–patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at ; the information is updated bimonthly. PMID:17085479

  7. PRAPI: post-transcriptional regulation analysis pipeline for Iso-Seq.

    PubMed

    Gao, Yubang; Wang, Huiyuan; Zhang, Hangxiao; Wang, Yongsheng; Chen, Jinfeng; Gu, Lianfeng

    2018-05-01

    The single-molecule real-time (SMRT) isoform sequencing (Iso-Seq) based on Pacific Bioscience (PacBio) platform has received increasing attention for its ability to explore full-length isoforms. Thus, comprehensive tools for Iso-Seq bioinformatics analysis are extremely useful. Here, we present a one-stop solution for Iso-Seq analysis, called PRAPI to analyze alternative transcription initiation (ATI), alternative splicing (AS), alternative cleavage and polyadenylation (APA), natural antisense transcripts (NAT), and circular RNAs (circRNAs) comprehensively. PRAPI is capable of combining Iso-Seq full-length isoforms with short read data, such as RNA-Seq or polyadenylation site sequencing (PAS-seq) for differential expression analysis of NAT, AS, APA and circRNAs. Furthermore, PRAPI can annotate new genes and correct mis-annotated genes when gene annotation is available. Finally, PRAPI generates high-quality vector graphics to visualize and highlight the Iso-Seq results. The Dockerfile of PRAPI is available at http://www.bioinfor.org/tool/PRAPI. lfgu@fafu.edu.cn.

  8. Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins

    PubMed Central

    Delcourt, Vivian; Lucier, Jean-François; Gagnon, Jules; Beaudoin, Maxime C; Vanderperre, Benoît; Breton, Marc-André; Motard, Julie; Jacques, Jean-François; Brunelle, Mylène; Gagnon-Arsenault, Isabelle; Fournier, Isabelle; Ouangraoua, Aida; Hunting, Darel J; Cohen, Alan A; Landry, Christian R; Scott, Michelle S

    2017-01-01

    Recent functional, proteomic and ribosome profiling studies in eukaryotes have concurrently demonstrated the translation of alternative open-reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by these altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and contain functional domains. Evolutionary analyses indicate that altORFs often show more extreme conservation patterns than their CDSs. Thousands of alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many genes are multicoding genes and code for a large protein and one or several small proteins. PMID:29083303

  9. Corpus annotation for mining biomedical events from literature

    PubMed Central

    Kim, Jin-Dong; Ohta, Tomoko; Tsujii, Jun'ichi

    2008-01-01

    Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain. PMID:18182099

  10. Fractal Clustering and Knowledge-driven Validation Assessment for Gene Expression Profiling.

    PubMed

    Wang, Lu-Yong; Balasubramanian, Ammaiappan; Chakraborty, Amit; Comaniciu, Dorin

    2005-01-01

    DNA microarray experiments generate a substantial amount of information about the global gene expression. Gene expression profiles can be represented as points in multi-dimensional space. It is essential to identify relevant groups of genes in biomedical research. Clustering is helpful in pattern recognition in gene expression profiles. A number of clustering techniques have been introduced. However, these traditional methods mainly utilize shape-based assumption or some distance metric to cluster the points in multi-dimension linear Euclidean space. Their results shows poor consistence with the functional annotation of genes in previous validation study. From a novel different perspective, we propose fractal clustering method to cluster genes using intrinsic (fractal) dimension from modern geometry. This method clusters points in such a way that points in the same clusters are more self-affine among themselves than to the points in other clusters. We assess this method using annotation-based validation assessment for gene clusters. It shows that this method is superior in identifying functional related gene groups than other traditional methods.

  11. Investigating the mechanisms of glyphosate resistance in goosegrass (Eleusine indica (L.) Gaertn.) by RNA sequencing technology.

    PubMed

    Chen, Jingchao; Huang, Hongjuan; Wei, Shouhui; Huang, Zhaofeng; Wang, Xu; Zhang, Chaoxian

    2017-01-01

    Glyphosate is an important non-selective herbicide that is in common use worldwide. However, evolved glyphosate-resistant (GR) weeds significantly affect crop yields. Unfortunately, the mechanisms underlying resistance in GR weeds, such as goosegrass (Eleusine indica (L.) Gaertn.), an annual weed found worldwide, have not been fully elucidated. In this study, transcriptome analysis was conducted to further assess the potential mechanisms of glyphosate resistance in goosegrass. The RNA sequencing libraries generated 24 597 462 clean reads. De novo assembly analysis produced 48 852 UniGenes with an average length of 847 bp. All UniGenes were annotated using seven databases. Sixteen candidate differentially expressed genes selected by digital gene expression analysis were validated by quantitative real-time PCR (qRT-PCR). Among these UniGenes, the EPSPS and PFK genes were constitutively up-regulated in resistant (R) individuals and showed a higher copy number than that in susceptible (S) individuals. The expressions of four UniGenes relevant to photosynthesis were inhibited by glyphosate in S individuals, and this toxic response was confirmed by gas exchange analysis. Two UniGenes annotated as glutathione transferase (GST) were constitutively up-regulated in R individuals, and were induced by glyphosate both in R and S. In addition, the GST activities in R individuals were higher than in S. Our research confirmed that two UniGenes (PFK, EPSPS) were strongly associated with target resistance, and two GST-annotated UniGenes may play a role in metabolic glyphosate resistance in goosegrass. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.

  12. Short interspersed nuclear elements (SINEs) are abundant in Solanaceae and have a family-specific impact on gene structure and genome organization.

    PubMed

    Seibt, Kathrin M; Wenke, Torsten; Muders, Katja; Truberg, Bernd; Schmidt, Thomas

    2016-05-01

    Short interspersed nuclear elements (SINEs) are highly abundant non-autonomous retrotransposons that are widespread in plants. They are short in size, non-coding, show high sequence diversity, and are therefore mostly not or not correctly annotated in plant genome sequences. Hence, comparative studies on genomic SINE populations are rare. To explore the structural organization and impact of SINEs, we comparatively investigated the genome sequences of the Solanaceae species potato (Solanum tuberosum), tomato (Solanum lycopersicum), wild tomato (Solanum pennellii), and two pepper cultivars (Capsicum annuum). Based on 8.5 Gbp sequence data, we annotated 82 983 SINE copies belonging to 10 families and subfamilies on a base pair level. Solanaceae SINEs are dispersed over all chromosomes with enrichments in distal regions. Depending on the genome assemblies and gene predictions, 30% of all SINE copies are associated with genes, particularly frequent in introns and untranslated regions (UTRs). The close association with genes is family specific. More than 10% of all genes annotated in the Solanaceae species investigated contain at least one SINE insertion, and we found genes harbouring up to 16 SINE copies. We demonstrate the involvement of SINEs in gene and genome evolution including the donation of splice sites, start and stop codons and exons to genes, enlargement of introns and UTRs, generation of tandem-like duplications and transduction of adjacent sequence regions. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.

  13. Transcriptome sequencing reveals high isoform diversity in the ant Formica exsecta

    PubMed Central

    Paviala, Jenni; Morandin, Claire; Wheat, Christopher; Sundström, Liselotte; Helanterä, Heikki

    2017-01-01

    Transcriptome resources for social insects have the potential to provide new insight into polyphenism, i.e., how divergent phenotypes arise from the same genome. Here we present a transcriptome based on paired-end RNA sequencing data for the ant Formica exsecta (Formicidae, Hymenoptera). The RNA sequencing libraries were constructed from samples of several life stages of both sexes and female castes of queens and workers, in order to maximize representation of expressed genes. We first compare the performance of common assembly and scaffolding software (Trinity, Velvet-Oases, and SOAPdenovo-trans), in producing de novo assemblies. Second, we annotate the resulting expressed contigs to the currently published genomes of ants, and other insects, including the honeybee, to filter genes that have annotation evidence of being true genes. Our pipeline resulted in a final assembly of altogether 39,262 mRNA transcripts, with an average coverage of >300X, belonging to 17,496 unique genes with annotation in the related ant species. From these genes, 536 genes were unique to one caste or sex only, highlighting the importance of comprehensive sampling. Our final assembly also showed expression of several splice variants in 6,975 genes, and we show that accounting for splice variants affects the outcome of downstream analyses such as gene ontologies. Our transcriptome provides an outstanding resource for future genetic studies on F. exsecta and other ant species, and the presented transcriptome assembly can be adapted to any non-model species that has genomic resources available from a related taxon. PMID:29177112

  14. Transcriptional dynamics of the developing sweet cherry (Prunus avium L.) fruit: sequencing, annotation and expression profiling of exocarp-associated genes

    PubMed Central

    Alkio, Merianne; Jonas, Uwe; Declercq, Myriam; Van Nocker, Steven; Knoche, Moritz

    2014-01-01

    The exocarp, or skin, of fleshy fruit is a specialized tissue that protects the fruit, attracts seed dispersing fruit eaters, and has large economical relevance for fruit quality. Development of the exocarp involves regulated activities of many genes. This research analyzed global gene expression in the exocarp of developing sweet cherry (Prunus avium L., ‘Regina’), a fruit crop species with little public genomic resources. A catalog of transcript models (contigs) representing expressed genes was constructed from de novo assembled short complementary DNA (cDNA) sequences generated from developing fruit between flowering and maturity at 14 time points. Expression levels in each sample were estimated for 34 695 contigs from numbers of reads mapping to each contig. Contigs were annotated functionally based on BLAST, gene ontology and InterProScan analyses. Coregulated genes were detected using partitional clustering of expression patterns. The results are discussed with emphasis on genes putatively involved in cuticle deposition, cell wall metabolism and sugar transport. The high temporal resolution of the expression patterns presented here reveals finely tuned developmental specialization of individual members of gene families. Moreover, the de novo assembled sweet cherry fruit transcriptome with 7760 full-length protein coding sequences and over 20 000 other, annotated cDNA sequences together with their developmental expression patterns is expected to accelerate molecular research on this important tree fruit crop. PMID:26504533

  15. Plant Omics Data Center: an integrated web repository for interspecies gene expression networks with NLP-based curation.

    PubMed

    Ohyanagi, Hajime; Takano, Tomoyuki; Terashima, Shin; Kobayashi, Masaaki; Kanno, Maasa; Morimoto, Kyoko; Kanegae, Hiromi; Sasaki, Yohei; Saito, Misa; Asano, Satomi; Ozaki, Soichi; Kudo, Toru; Yokoyama, Koji; Aya, Koichiro; Suwabe, Keita; Suzuki, Go; Aoki, Koh; Kubo, Yasutaka; Watanabe, Masao; Matsuoka, Makoto; Yano, Kentaro

    2015-01-01

    Comprehensive integration of large-scale omics resources such as genomes, transcriptomes and metabolomes will provide deeper insights into broader aspects of molecular biology. For better understanding of plant biology, we aim to construct a next-generation sequencing (NGS)-derived gene expression network (GEN) repository for a broad range of plant species. So far we have incorporated information about 745 high-quality mRNA sequencing (mRNA-Seq) samples from eight plant species (Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, Solanum tuberosum, Medicago truncatula and Glycine max) from the public short read archive, digitally profiled the entire set of gene expression profiles, and drawn GENs by using correspondence analysis (CA) to take advantage of gene expression similarities. In order to understand the evolutionary significance of the GENs from multiple species, they were linked according to the orthology of each node (gene) among species. In addition to other gene expression information, functional annotation of the genes will facilitate biological comprehension. Currently we are improving the given gene annotations with natural language processing (NLP) techniques and manual curation. Here we introduce the current status of our analyses and the web database, PODC (Plant Omics Data Center; http://bioinf.mind.meiji.ac.jp/podc/), now open to the public, providing GENs, functional annotations and additional comprehensive omics resources. © The Author 2014. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists.

  16. Annotation of UAV surveillance video

    NASA Astrophysics Data System (ADS)

    Howlett, Todd; Robertson, Mark A.; Manthey, Dan; Krol, John

    2004-08-01

    Significant progress toward the development of a video annotation capability is presented in this paper. Research and development of an object tracking algorithm applicable for UAV video is described. Object tracking is necessary for attaching the annotations to the objects of interest. A methodology and format is defined for encoding video annotations using the SMPTE Key-Length-Value encoding standard. This provides the following benefits: a non-destructive annotation, compliance with existing standards, video playback in systems that are not annotation enabled and support for a real-time implementation. A model real-time video annotation system is also presented, at a high level, using the MPEG-2 Transport Stream as the transmission medium. This work was accomplished to meet the Department of Defense"s (DoD"s) need for a video annotation capability. Current practices for creating annotated products are to capture a still image frame, annotate it using an Electric Light Table application, and then pass the annotated image on as a product. That is not adequate for reporting or downstream cueing. It is too slow and there is a severe loss of information. This paper describes a capability for annotating directly on the video.

  17. The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations.

    PubMed

    Cerqueira, Gustavo C; Arnaud, Martha B; Inglis, Diane O; Skrzypek, Marek S; Binkley, Gail; Simison, Matt; Miyasato, Stuart R; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R

    2014-01-01

    The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome.

  18. Assembling the Setaria italica L. Beauv. genome into nine chromosomes and insights into regions affecting growth and drought tolerance

    PubMed Central

    Tsai, Kevin J.; Lu, Mei-Yeh Jade; Yang, Kai-Jung; Li, Mengyun; Teng, Yuchuan; Chen, Shihmay; Ku, Maurice S. B.; Li, Wen-Hsiung

    2016-01-01

    The diploid C4 plant foxtail millet (Setaria italica L. Beauv.) is an important crop in many parts of Africa and Asia for the vast consumption of its grain and ability to grow in harsh environments, but remains understudied in terms of complete genomic architecture. To date, there have been only two genome assembly and annotation efforts with neither assembly reaching over 86% of the estimated genome size. We have combined de novo assembly with custom reference-guided improvements on a popular cultivar of foxtail millet and have achieved a genome assembly of 477 Mbp in length, which represents over 97% of the estimated 490 Mbp. The assembly anchors over 98% of the predicted genes to the nine assembled nuclear chromosomes and contains more functional annotation gene models than previous assemblies. Our annotation has identified a large number of unique gene ontology terms related to metabolic activities, a region of chromosome 9 with several growth factor proteins, and regions syntenic with pearl millet or maize genomic regions that have been previously shown to affect growth. The new assembly and annotation for this important species can be used for detailed investigation and future innovations in growth for millet and other grains. PMID:27734962

  19. Assembling the Setaria italica L. Beauv. genome into nine chromosomes and insights into regions affecting growth and drought tolerance.

    PubMed

    Tsai, Kevin J; Lu, Mei-Yeh Jade; Yang, Kai-Jung; Li, Mengyun; Teng, Yuchuan; Chen, Shihmay; Ku, Maurice S B; Li, Wen-Hsiung

    2016-10-13

    The diploid C 4 plant foxtail millet (Setaria italica L. Beauv.) is an important crop in many parts of Africa and Asia for the vast consumption of its grain and ability to grow in harsh environments, but remains understudied in terms of complete genomic architecture. To date, there have been only two genome assembly and annotation efforts with neither assembly reaching over 86% of the estimated genome size. We have combined de novo assembly with custom reference-guided improvements on a popular cultivar of foxtail millet and have achieved a genome assembly of 477 Mbp in length, which represents over 97% of the estimated 490 Mbp. The assembly anchors over 98% of the predicted genes to the nine assembled nuclear chromosomes and contains more functional annotation gene models than previous assemblies. Our annotation has identified a large number of unique gene ontology terms related to metabolic activities, a region of chromosome 9 with several growth factor proteins, and regions syntenic with pearl millet or maize genomic regions that have been previously shown to affect growth. The new assembly and annotation for this important species can be used for detailed investigation and future innovations in growth for millet and other grains.

  20. GermOnline 4.0 is a genomics gateway for germline development, meiosis and the mitotic cell cycle.

    PubMed

    Lardenois, Aurélie; Gattiker, Alexandre; Collin, Olivier; Chalmel, Frédéric; Primig, Michael

    2010-01-01

    GermOnline 4.0 is a cross-species database portal focusing on high-throughput expression data relevant for germline development, the meiotic cell cycle and mitosis in healthy versus malignant cells. It is thus a source of information for life scientists as well as clinicians who are interested in gene expression and regulatory networks. The GermOnline gateway provides unlimited access to information produced with high-density oligonucleotide microarrays (3'-UTR GeneChips), genome-wide protein-DNA binding assays and protein-protein interaction studies in the context of Ensembl genome annotation. Samples used to produce high-throughput expression data and to carry out genome-wide in vivo DNA binding assays are annotated via the MIAME-compliant Multiomics Information Management and Annotation System (MIMAS 3.0). Furthermore, the Saccharomyces Genomics Viewer (SGV) was developed and integrated into the gateway. SGV is a visualization tool that outputs genome annotation and DNA-strand specific expression data produced with high-density oligonucleotide tiling microarrays (Sc_tlg GeneChips) which cover the complete budding yeast genome on both DNA strands. It facilitates the interpretation of expression levels and transcript structures determined for various cell types cultured under different growth and differentiation conditions. Database URL: www.germonline.org/

Top