Science.gov

Sample records for functional gene annotation

  1. Exploiting ontology graph for predicting sparsely annotated gene function

    PubMed Central

    Wang, Sheng; Cho, Hyunghoon; Zhai, ChengXiang; Berger, Bonnie; Peng, Jian

    2015-01-01

    Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this ‘overfitting’ issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog. Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions. Availability and implementation: https://github.com/wangshenguiuc/clusDCA. Contact: jianpeng@illinois.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26072504

  2. Functional annotation of human cytomegalovirus gene products: an update

    PubMed Central

    Van Damme, Ellen; Van Loock, Marnix

    2014-01-01

    Human cytomegalovirus is an opportunistic double-stranded DNA virus with one of the largest viral genomes known. The 235 kB genome is divided in a unique long (UL) and a unique short (US) region which are flanked by terminal and internal repeats. The expression of HCMV genes is highly complex and involves the production of protein coding transcripts, polyadenylated long non-coding RNAs, polyadenylated anti-sense transcripts and a variety of non-polyadenylated RNAs such as microRNAs. Although the function of many of these transcripts is unknown, they are suggested to play a direct or regulatory role in the delicately orchestrated processes that ensure HCMV replication and life-long persistence. This review focuses on annotating the complete viral genome based on three sources of information. First, previous reviews were used as a template for the functional keywords to ensure continuity; second, the Uniprot database was used to further enrich the functional database; and finally, the literature was manually curated for novel functions of HCMV gene products. Novel discoveries were discussed in light of the viral life cycle. This functional annotation highlights still poorly understood regions of the genome but more importantly it can give insight in functional clusters and/or may be helpful in the analysis of future transcriptomics and proteomics studies. PMID:24904534

  3. Functional annotation of rare gene aberration drivers of pancreatic cancer

    PubMed Central

    Tsang, Yiu Huen; Dogruluk, Turgut; Tedeschi, Philip M.; Wardwell-Ozgo, Joanna; Lu, Hengyu; Espitia, Maribel; Nair, Nikitha; Minelli, Rosalba; Chong, Zechen; Chen, Fengju; Chang, Qing Edward; Dennison, Jennifer B.; Dogruluk, Armel; Li, Min; Ying, Haoqiang; Bertino, Joseph R.; Gingras, Marie-Claude; Ittmann, Michael; Kerrigan, John; Chen, Ken; Creighton, Chad J.; Eterovic, Karina; Mills, Gordon B.; Scott, Kenneth L.

    2016-01-01

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC). This approach reveals oncogenic activity for rare gene aberrations in genes including NAD Kinase (NADK), which regulates NADP(H) homeostasis and cellular redox state. We further validate mutant NADK, whose expression provides gain-of-function enzymatic activity leading to a reduction in cellular reactive oxygen species and tumorigenesis, and show that depletion of wild-type NADK in PDAC cell lines attenuates cancer cell growth in vitro and in vivo. These data indicate that annotating rare aberrations can reveal important cancer signalling pathways representing additional therapeutic targets. PMID:26806015

  4. Functional annotation of rare gene aberration drivers of pancreatic cancer.

    PubMed

    Tsang, Yiu Huen; Dogruluk, Turgut; Tedeschi, Philip M; Wardwell-Ozgo, Joanna; Lu, Hengyu; Espitia, Maribel; Nair, Nikitha; Minelli, Rosalba; Chong, Zechen; Chen, Fengju; Chang, Qing Edward; Dennison, Jennifer B; Dogruluk, Armel; Li, Min; Ying, Haoqiang; Bertino, Joseph R; Gingras, Marie-Claude; Ittmann, Michael; Kerrigan, John; Chen, Ken; Creighton, Chad J; Eterovic, Karina; Mills, Gordon B; Scott, Kenneth L

    2016-01-01

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC). This approach reveals oncogenic activity for rare gene aberrations in genes including NAD Kinase (NADK), which regulates NADP(H) homeostasis and cellular redox state. We further validate mutant NADK, whose expression provides gain-of-function enzymatic activity leading to a reduction in cellular reactive oxygen species and tumorigenesis, and show that depletion of wild-type NADK in PDAC cell lines attenuates cancer cell growth in vitro and in vivo. These data indicate that annotating rare aberrations can reveal important cancer signalling pathways representing additional therapeutic targets. PMID:26806015

  5. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    SciTech Connect

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.

  6. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGESBeta

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  7. Information theory applied to the sparse gene ontology annotation network to predict novel gene function

    PubMed Central

    Tao, Ying; Li, Jianrong

    2010-01-01

    Motivation Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches. Results We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003. Availability The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt. PMID:17646340

  8. The Duplicated Genes Database: Identification and Functional Annotation of Co-Localised Duplicated Genes across Genomes

    PubMed Central

    Bretaudeau, Anthony; Sallou, Olivier; Diot, Christian; Demeure, Olivier; Lecerf, Frédéric

    2012-01-01

    Background There has been a surge in studies linking genome structure and gene expression, with special focus on duplicated genes. Although initially duplicated from the same sequence, duplicated genes can diverge strongly over evolution and take on different functions or regulated expression. However, information on the function and expression of duplicated genes remains sparse. Identifying groups of duplicated genes in different genomes and characterizing their expression and function would therefore be of great interest to the research community. The ‘Duplicated Genes Database’ (DGD) was developed for this purpose. Methodology Nine species were included in the DGD. For each species, BLAST analyses were conducted on peptide sequences corresponding to the genes mapped on a same chromosome. Groups of duplicated genes were defined based on these pairwise BLAST comparisons and the genomic location of the genes. For each group, Pearson correlations between gene expression data and semantic similarities between functional GO annotations were also computed when the relevant information was available. Conclusions The Duplicated Gene Database provides a list of co-localised and duplicated genes for several species with the available gene co-expression level and semantic similarity value of functional annotation. Adding these data to the groups of duplicated genes provides biological information that can prove useful to gene expression analyses. The Duplicated Gene Database can be freely accessed through the DGD website at http://dgd.genouest.org. PMID:23209799

  9. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  10. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  11. Algal functional annotation tool

    Energy Science and Technology Software Center (ESTSC)

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations tomore » interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion. CONCLUSIONS: The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.« less

  12. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion. CONCLUSIONS: The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.

  13. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations

    SciTech Connect

    Picone, Alex F.; Galperin, Michael Y.; Romine, Margaret; Higdon, Roger; Makarova, Kira S.; Kolker, Natali; Anderson, Gordon A; Qiu, Xiaoyun; Babnigg, Gyorgy; Beliaev, Alexander S; Edlefsen, Paul; Elias, Dwayne A.; Gorby, Dr. Yuri A.; Holzman, Ted; Klappenbach, Joel; Konstantinidis, Konstantinos T; Land, Miriam L; Lipton, Mary S.; McCue, Lee Ann; Monroe, Matthew; Pasa-Tolic, Ljiljana; Pinchuk, Grigoriy; Purvine, Samuel; Serres, Margrethe H.; Tsapin, Sasha; Zakrajsek, Brian A.; Zhu, Wenguang; Zhou, Jizhong; Larimer, Frank W; Lawrence, Charles E.; Riley, Monica; Collart, Frank; YatesIII, John R.; Smith, Richard D.; Nealson, Kenneth H.; Fredrickson, James K; Tiedje, James M.

    2005-01-01

    The gamma-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, approximate to40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized "hypothetical" genes. We implemented an integrative approach by using experimental and computational analyses to provide more detailed insight into gene function. Global expression profiles were determined for cells after UV irradiation and under aerobic and suboxic growth conditions. Transcriptomic and proteomic analyses confidently identified 538 hypothetical genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all predicted hypothetical proteins). Publicly available analysis tools and databases and the expression data were applied to improve the annotation of these genes. The annotation results were scored by using a seven-category schema that ranked both confidence and precision of the functional assignment. We were able to identify homologs for nearly all of these hypothetical proteins (97%), but could confidently assign exact biochemical functions for only 16 proteins (category 1; 3%). Altogether, computational and experimental evidence provided functional assignments or insights for 240 more genes (categories 2-5; 45%). These functional annotations advance our understanding of genes involved in vital cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction. We propose that this integrative approach offers a valuable means to undertake the enormous challenge of characterizing the rapidly growing number of hypothetical proteins with each newly sequenced genome.

  14. Visual annotation display (VLAD): a tool for finding functional themes in lists of genes.

    PubMed

    Richardson, Joel E; Bult, Carol J

    2015-10-01

    Experiments that employ genome scale technology platforms frequently result in lists of tens to thousands of genes with potential significance to a specific biological process or disease. Searching for biologically relevant connections among the genes or gene products in these lists is a common data analysis task. We have implemented a software application for uncovering functional themes in sets of genes based on their annotations to bio-ontologies, such as the gene ontology and the mammalian phenotype ontology. The application, called VisuaL Annotation Display (VLAD), performs a statistical analysis to test for the enrichment of ontology terms in a set of genes submitted by a researcher. The results for each analysis using VLAD includes a table of ontology terms, sorted in decreasing order of significance. Each row contains the term, statistics such as the number of annotated terms, the p value, etc., and the symbols of annotated genes. An accompanying graphical display shows portions of the ontology hierarchy, where node sizes are scaled based on p values. Although numerous ontology term enrichment programs already exist, VLAD is unique in that it allows users to upload their own annotation files and ontologies for customized term enrichment analyses, supports the analysis of multiple gene sets at once, provides interfaces to customize graphical output, and is tightly integrated with functional and biological details about mouse genes in the Mouse Genome Informatics (MGI) database. VLAD is available as a web-based application from the MGI web site (http://proto.informatics.jax.org/prototypes/vlad/). PMID:26047590

  15. Gene fusions and gene duplications: relevance to genomic annotation and functional analysis

    PubMed Central

    Serres, Margrethe H; Riley, Monica

    2005-01-01

    Background Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular) proteins consist of two or more components (modules) encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work overstated the number of multimodular proteins in E. coli. This work corrects the identification of modules by including sequence information from proteins in 50 sequenced microbial genomes. Results Multimodular E. coli K-12 proteins were identified from sequence similarities between their component modules and non-fused proteins in 50 genomes and from the literature. We found 109 multimodular proteins in E. coli containing either two or three modules. Most modules had standalone sequence relatives in other genomes. The separated modules together with all the single (un-fused) proteins constitute the sum of all unimodular proteins of E. coli. Pairwise sequence relationships among all E. coli unimodular proteins generated 490 sequence similar, paralogous groups. Groups ranged in size from 92 to 2 members and had varying degrees of relatedness among their members. Some E. coli enzyme groups were compared to homologs in other bacterial genomes. Conclusion The deleterious effects of multimodular proteins on annotation and on the formation of groups of paralogs are emphasized. To improve annotation results, all multimodular proteins in an organism should be detected and when known each function should be connected with its location in the sequence of the protein. When transferring functions by sequence similarity, alignment locations must be noted, particularly when alignments cover only part of the sequences, in order to enable transfer of the correct function. Separating multimodular proteins into module units makes it possible to generate protein groups related by both sequence and function, avoiding mixing of unrelated sequences. Organisms differ in sizes of groups of sequence-related proteins. A sample comparison of orthologs to selected E. coli paralogous groups correlates with known physiological and taxonomic relationships between the organisms. PMID:15757509

  16. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  17. Annotation of plant gene function via combined genomics, metabolomics and informatics.

    PubMed

    Tohge, Takayuki; Fernie, Alisdair R

    2012-01-01

    Given the ever expanding number of model plant species for which complete genome sequences are available and the abundance of bio-resources such as knockout mutants, wild accessions and advanced breeding populations, there is a rising burden for gene functional annotation. In this protocol, annotation of plant gene function using combined co-expression gene analysis, metabolomics and informatics is provided (Figure 1). This approach is based on the theory of using target genes of known function to allow the identification of non-annotated genes likely to be involved in a certain metabolic process, with the identification of target compounds via metabolomics. Strategies are put forward for applying this information on populations generated by both forward and reverse genetics approaches in spite of none of these are effortless. By corollary this approach can also be used as an approach to characterise unknown peaks representing new or specific secondary metabolites in the limited tissues, plant species or stress treatment, which is currently the important trial to understanding plant metabolism. PMID:22733029

  18. Annotation of Plant Gene Function via Combined Genomics, Metabolomics and Informatics

    PubMed Central

    Tohge, Takayuki; Fernie, Alisdair R.

    2012-01-01

    Given the ever expanding number of model plant species for which complete genome sequences are available and the abundance of bio-resources such as knockout mutants, wild accessions and advanced breeding populations, there is a rising burden for gene functional annotation. In this protocol, annotation of plant gene function using combined co-expression gene analysis, metabolomics and informatics is provided (Figure 1). This approach is based on the theory of using target genes of known function to allow the identification of non-annotated genes likely to be involved in a certain metabolic process, with the identification of target compounds via metabolomics. Strategies are put forward for applying this information on populations generated by both forward and reverse genetics approaches in spite of none of these are effortless. By corollary this approach can also be used as an approach to characterise unknown peaks representing new or specific secondary metabolites in the limited tissues, plant species or stress treatment, which is currently the important trial to understanding plant metabolism. PMID:22733029

  19. Finding New Order in Biological Functions from the Network Structure of Gene Annotations

    PubMed Central

    Glass, Kimberly; Girvan, Michelle

    2015-01-01

    The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. PMID:26588252

  20. Finding New Order in Biological Functions from the Network Structure of Gene Annotations.

    PubMed

    Glass, Kimberly; Girvan, Michelle

    2015-11-01

    The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. PMID:26588252

  1. Annotating genes of known and unknown function by large-scale coexpression analysis.

    PubMed

    Horan, Kevin; Jang, Charles; Bailey-Serres, Julia; Mittler, Ron; Shelton, Christian; Harper, Jeff F; Zhu, Jian-Kang; Cushman, John C; Gollery, Martin; Girke, Thomas

    2008-05-01

    About 40% of the proteins encoded in eukaryotic genomes are proteins of unknown function (PUFs). Their functional characterization remains one of the main challenges in modern biology. In this study we identified the PUF encoding genes from Arabidopsis (Arabidopsis thaliana) using a combination of sequence similarity, domain-based, and empirical approaches. Large-scale gene expression analyses of 1,310 publicly available Affymetrix chips were performed to associate the identified PUF genes with regulatory networks and biological processes of known function. To generate quality results, the study was restricted to expression sets with replicated samples. First, genome-wide clustering and gene function enrichment analysis of clusters allowed us to associate 1,541 PUF genes with tightly coexpressed genes for proteins of known function (PKFs). Over 70% of them could be assigned to more specific biological process annotations than the ones available in the current Gene Ontology release. The most highly overrepresented functional categories in the obtained clusters were ribosome assembly, photosynthesis, and cell wall pathways. Interestingly, the majority of the PUF genes appeared to be controlled by the same regulatory networks as most PKF genes, because clusters enriched in PUF genes were extremely rare. Second, large-scale analysis of differentially expressed genes was applied to identify a comprehensive set of abiotic stress-response genes. This analysis resulted in the identification of 269 PKF and 104 PUF genes that responded to a wide variety of abiotic stresses, whereas 608 PKF and 206 PUF genes responded predominantly to specific stress treatments. The provided coexpression and differentially expressed gene data represent an important resource for guiding future functional characterization experiments of PUF and PKF genes. Finally, the public Plant Gene Expression Database (http://bioweb.ucr.edu/PED) was developed as part of this project to provide efficient access and mining tools for the vast gene expression data of this study. PMID:18354039

  2. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcine P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-03-17

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes.

  3. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus. PMID:23166516

  4. Gene Ontology Annotations and Resources

    PubMed Central

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  5. Integrating biological knowledge based on functional annotations for biclustering of gene expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2015-05-01

    Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function. PMID:25843807

  6. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-01-01

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology. PMID:26631125

  7. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed Central

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-01-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  8. Analysis of bovine mammary gland EST and functional annotation of the Bos taurus gene index.

    PubMed

    Sonstegard, Tad S; Capuco, Anthony V; White, Joseph; Van Tassell, Curtis P; Connor, Erin E; Cho, Jennifer; Sultana, Razvan; Shade, Larry; Wray, James E; Wells, Kevin D; Quackenbush, John

    2002-07-01

    Functional genomic studies of the mammary gland require an appropriate collection of cDNA sequences to assess gene expression patterns from the different developmental and operational states of underlying cell types. To better capture the range of gene expression, a normalized cDNA library was constructed from pooled bovine mammary tissues, and 23,202 expressed sequence tags (EST) were produced and deposited into GenBank. Assembly of these EST with sequences in the Bos taurus Gene Index (BtGI) helped to form 5751 of the current 23,883 tentative consensus (TC) sequences. The majority (87%) of these 5751 assemblies contained only one to three mammary-derived EST. In contrast, 18% of the mammary EST assembled with TC sequences corresponding to 12 genes. These results suggest library normalization was only partially effective, because the reduction in EST for genes abundantly transcribed during lactation could be attributed to pooling. For better assessment of novel content in the mammary library and to add to existing annotation of all bovine sequence elements, gene ontology assignments, and comparative sequence analyses against human genome sequence, human and rodent gene indices, and an index of orthologous alignments of genes across eukaryotes (TOGA) were performed, and results were added to existing BtGI annotation. Over 35,000 of the bovine elements significantly matched human genome sequence, and the positions of some alignments (3%) were unique relative to those using human expressed sequences. Because 3445 TC sequences had no significant match with any data set, mammary-derived cDNA clones representing 23 of these elements were analyzed further for expression and novelty. Only one clone met criteria suggesting the corresponding gene was a divergent ortholog or expressed sequence unique to cattle. These results demonstrate that bovine sequence expression data serve as a resource for characterizing mammalian transcriptomes and identifying those genes potentially unique to ruminants. PMID:12140684

  9. Identification and functional annotation of mycobacterial septum formation genes using cell division mutants of Escherichia coli.

    PubMed

    Gaiwala Sharma, Sujata S; Kishore, Vimal; Raghunand, Tirumalai R

    2016-01-01

    The major virulence trait of Mycobacterium tuberculosis is its ability to enter a latent state in the face of robust host immunity. Clues to the molecular basis of latency can emerge from understanding the mechanism of cell division, beginning with identification of proteins involved in this process. Using complementation of Escherichia coli mutants, we functionally annotated M. tuberculosis and Mycobacterium smegmatis homologs of divisome proteins FtsW and AmiC. Our results demonstrate that E. coli can be used as a surrogate model to discover mycobacterial cell division genes, and should prove invaluable in delineating the mechanisms of this fundamental process in mycobacteria. PMID:26577656

  10. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations. PMID:25916950

  11. Identification and functional annotation of lncRNA genes with hypermethylation in colorectal cancer.

    PubMed

    Liao, Qi; He, Weiling; Liu, Jianfa; Cen, Yi; Luo, Liang; Yu, Chengliang; Li, Yang; Chen, Sitong; Duan, Shiwei

    2015-11-10

    Colorectal cancer (CRC) is one of the leading causes of mortality worldwide. DNA methylation is an important epigenetic modification for CRC. Although currently a number of studies about DNA methylation of protein coding genes have been carried out, only a few are about the methylation of genes encoding the long noncoding RNAs (lncRNAs). In this study, we identified 761 lncRNA genes with DNA hypermethylation in CRC using a free MethylCap-seq dataset. Integration of lncRNA expression and methylation datasets showed that the expression of lncRNAs is negatively correlated with DNA methylation (p<0.01). Co-methylation network was also constructed to annotate the functions of unknown lncRNAs. Our results showed that a total of 364 lncRNAs were annotated with at least one GO biological process term. The current data-mining work is likely to provide informative clues for biological researchers to further understand the role of lncRNAs in the development of CRC. PMID:26172871

  12. Functional Annotation of Genes Overlapping Copy Number Variants in Autistic Patients: Focus on Axon Pathfinding

    PubMed Central

    Sbacchi, Silvia; Acquadro, Francesco; Calò, Ignazio; Calì, Francesco; Romano, Valentino

    2010-01-01

    We have used Gene Ontology (GO) and pathway analyses to uncover the common functions associated to the genes overlapping Copy Number Variants (CNVs) in autistic patients. Our source of data were four published studies [1-4]. We first applied a two-step enrichment strategy for autism-specific genes. We fished out from the four mentioned studies a list of 2928 genes overall overlapping 328 CNVs in patients and we first selected a sub-group of 2044 genes after excluding those ones that are also involved in CNVs reported in the Database of Genomic Variants (enrichment step 1). We then selected from the step 1-enriched list a sub-group of 514 genes each of which was found to be deleted or duplicated in at least two patients (enrichment step 2). The number of statistically significant processes and pathways identified by the Database for Annotation, Visualization and Integrated Discovery and Ingenuity Pathways Analysis softwares with the step 2-enriched list was significantly higher compared to the step 1-enriched list. In addition, statistically significant GO terms, biofunctions and pathways related to nervous system development and function were exclusively identified by the step 2-enriched list of genes. Interestingly, 21 genes were associated to axon growth and pathfinding. The latter genes and other ones associated to nervous system in this study represent a new set of autism candidate genes deserving further investigation. In summary, our results suggest that the autism’s “connectivity genes” in some patients affect very early phases of neurodevelopment, i.e., earlier than synaptogenesis. PMID:20885821

  13. First survey and functional annotation of prohormone and convertase genes in the pig

    PubMed Central

    2012-01-01

    Background The pig is a biomedical model to study human and livestock traits. Many of these traits are controlled by neuropeptides that result from the cleavage of prohormones by prohormone convertases. Only 45 prohormones have been confirmed in the pig. Sequence homology can be ineffective to annotate prohormone genes in sequenced species like the pig due to the multifactorial nature of the prohormone processing. The goal of this study is to undertake the first complete survey of prohormone and prohormone convertases genes in the pig genome. These genes were functionally annotated based on 35 gene expression microarray experiments. The cleavage sites of prohormone sequences into potentially active neuropeptides were predicted. Results We identified 95 unique prohormone genes, 2 alternative calcitonin-related sequences, 8 prohormone convertases and 1 cleavage facilitator in the pig genome 10.2 assembly and trace archives. Of these, 11 pig prohormone genes have not been reported in the UniProt, UniGene or Gene databases. These genes are intermedin, cortistatin, insulin-like 5, orexigenic neuropeptide QRFP, prokineticin 2, prolactin-releasing peptide, parathyroid hormone 2, urocortin, urocortin 2, urocortin 3, and urotensin 2-related peptide. In addition, a novel neuropeptide S was identified in the pig genome correcting the previously reported pig sequence that is identical to the rabbit sequence. Most differentially expressed prohormone genes were under-expressed in pigs experiencing immune challenge relative to the un-challenged controls, in non-pregnant relative to pregnant sows, in old relative to young embryos, and in non-neural relative to neural tissues. The cleavage prediction based on human sequences had the best performance with a correct classification rate of cleaved and non-cleaved sites of 92% suggesting that the processing of prohormones in pigs is similar to humans. The cleavage prediction models did not find conclusive evidence supporting the production of the bioactive neuropeptides urocortin 2, urocortin 3, torsin family 2 member A, tachykinin 4, islet amyloid polypeptide, and calcitonin receptor-stimulating peptide 2 in the pig. Conclusions The present genomic and functional characterization supports the use of the pig as an effective animal model to gain a deeper understanding of prohormones, prohormone convertases and neuropeptides in biomedical and agricultural research. PMID:23153308

  14. Gene Expression and Functional Annotation of the Human Ciliary Body Epithelia

    PubMed Central

    Janssen, Sarah F.; Gorgels, Theo G. M. F.; Bossers, Koen; ten Brink, Jacoline B.; Essing, Anke H. W.; Nagtegaal, Martijn; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

    2012-01-01

    Purpose The ciliary body (CB) of the human eye consists of the non-pigmented (NPE) and pigmented (PE) neuro-epithelia. We investigated the gene expression of NPE and PE, to shed light on the molecular mechanisms underlying the most important functions of the CB. We also developed molecular signatures for the NPE and PE and studied possible new clues for glaucoma. Methods We isolated NPE and PE cells from seven healthy human donor eyes using laser dissection microscopy. Next, we performed RNA isolation, amplification, labeling and hybridization against 44×k Agilent microarrays. For microarray conformations, we used a literature study, RT-PCRs, and immunohistochemical stainings. We analyzed the gene expression data with R and with the knowledge database Ingenuity. Results The gene expression profiles and functional annotations of the NPE and PE were highly similar. We found that the most important functionalities of the NPE and PE were related to developmental processes, neural nature of the tissue, endocrine and metabolic signaling, and immunological functions. In total 1576 genes differed statistically significantly between NPE and PE. From these genes, at least 3 were cell-specific for the NPE and 143 for the PE. Finally, we observed high expression in the (N)PE of 35 genes previously implicated in molecular mechanisms related to glaucoma. Conclusion Our gene expression analysis suggested that the NPE and PE of the CB were quite similar. Nonetheless, cell-type specific differences were found. The molecular machineries of the human NPE and PE are involved in a range of neuro-endocrinological, developmental and immunological functions, and perhaps glaucoma. PMID:23028713

  15. Human cell adhesion molecules: annotated functional subtypes and overrepresentation of addiction-associated genes.

    PubMed

    Zhong, Xiaoming; Drgonova, Jana; Li, Chuan-Yun; Uhl, George R

    2015-09-01

    Human cell adhesion molecules (CAMs) are essential for proper development, modulation, and maintenance of interactions between cells and cell-to-cell (and matrix-to-cell) communication about these interactions. Despite the differential functional significance of these roles, there have been surprisingly few systematic studies to enumerate the universe of CAMs and identify specific CAMs in distinct functions. In this paper, we update and review the set of human genes likely to encode CAMs with searches of databases, literature reviews, and annotations. We describe likely CAMs and functional subclasses, including CAMs that have a primary function in information exchange (iCAMs), CAMs involved in focal adhesions, CAM gene products that are preferentially involved with stereotyped and morphologically identifiable connections between cells (e.g., adherens junctions, gap junctions), and smaller numbers of CAM genes in other classes. We discuss a novel proposed mechanism involving selective anchoring of the constituents of iCAM-containing lipid rafts in zones of close neuronal apposition to membranes expressing iCAM binding partners. We also discuss data from genetic and genomic studies of addiction in humans and mouse models to highlight the ways in which CAM variation may contribute to a specific brain-based disorder such as addiction. Specific examples include changes in CAM mRNA splicing mediated by differences in the addiction-associated splicing regulator RBFOX1/A2BP1 and CAM expression in dopamine neurons. PMID:25988664

  16. Annotation of gene function in citrus using gene expression information and co-expression networks

    PubMed Central

    2014-01-01

    Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks, functional enrichment analysis and gene expression information provide opportunities to infer gene function in citrus. We present a publicly accessible tool, Network Inference for Citrus Co-Expression (NICCE, http://citrus.adelaide.edu.au/nicce/home.aspx), for the gene co-expression analysis in citrus. PMID:25023870

  17. Functional Annotation of Hierarchical Modularity

    PubMed Central

    Padmanabhan, Kanchana; Wang, Kuangyu; Samatova, Nagiza F.

    2012-01-01

    In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function–hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of “enriched” functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13). PMID:22496762

  18. Phylogenetic molecular function annotation

    NASA Astrophysics Data System (ADS)

    Engelhardt, Barbara E.; Jordan, Michael I.; Repo, Susanna T.; Brenner, Steven E.

    2009-07-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

  19. Phylogenetic molecular function annotation

    PubMed Central

    Engelhardt, Barbara E; Jordan, Michael I; Repo, Susanna T; Brenner, Steven E

    2010-01-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called “phylogenomics”) is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods. PMID:20664722

  20. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.

    PubMed

    Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke

    2009-02-15

    Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/. PMID:18655063

  1. An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction

    PubMed Central

    Piro, Rosario Michael; Ala, Ugo; Molineris, Ivan; Grassi, Elena; Bracco, Chiara; Perego, Gian Paolo; Provero, Paolo; Di Cunto, Ferdinando

    2011-01-01

    Gene coexpression relationships that are phylogenetically conserved between human and mouse have been shown to provide important clues about gene function that can be efficiently used to identify promising candidate genes for human hereditary disorders. In the past, such approaches have considered mostly generic gene expression profiles that cover multiple tissues and organs. The individual genes of multicellular organisms, however, can participate in different transcriptional programs, operating at scales as different as single-cell types, tissues, organs, body regions or the entire organism. Therefore, systematic analysis of tissue-specific coexpression could be, in principle, a very powerful strategy to dissect those functional relationships among genes that emerge only in particular tissues or organs. In this report, we show that, in fact, conserved coexpression as determined from tissue-specific and condition-specific data sets can predict many functional relationships that are not detected by analyzing heterogeneous microarray data sets. More importantly, we find that, when combined with disease networks, the simultaneous use of both generic (multi-tissue) and tissue-specific conserved coexpression allows a more efficient prediction of human disease genes than the use of generic conserved coexpression alone. Using this strategy, we were able to identify high-probability candidates for 238 orphan disease loci. We provide proof of concept that this combined use of generic and tissue-specific conserved coexpression can be very useful to prioritize the mutational candidates obtained from deep-sequencing projects, even in the case of genetic disorders as heterogeneous as XLMR. PMID:21654723

  2. In Silico Functional Annotation of Genomic Variation.

    PubMed

    Butkiewicz, Mariusz; Bush, William S

    2016-01-01

    This unit describes the concepts and practical techniques for annotating genomic variants in the human genome to estimate their functional significance. With the rapid increase of available whole exome and whole genome sequencing information for human studies, annotation techniques have become progressively more important for highlighting and prioritizing nucleotide variants and their potential impact on genes and other genetic constructs. Here, we present an overview of different types of variant annotation approaches and elaborate on their foundations, assumptions, and the downstream consequences of their use. Computational approaches and tools to assign annotations and to identify variants are reviewed. Further, the general philosophy of assigning potential function to a genetic change within the biological context of a disease is discussed. © 2016 by John Wiley & Sons, Inc. PMID:26724722

  3. Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.

    2011-01-01

    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499

  4. Functional annotation of rare gene aberration drivers of pancreatic cancer | Office of Cancer Genomics

    Cancer.gov

    As we enter the era of precision medicine, characterization of cancer genomes will directly influence therapeutic decisions in the clinic. Here we describe a platform enabling functionalization of rare gene mutations through their high-throughput construction, molecular barcoding and delivery to cancer models for in vivo tumour driver screens. We apply these technologies to identify oncogenic drivers of pancreatic ductal adenocarcinoma (PDAC).

  5. Morgan's legacy: fruit flies and the functional annotation of conserved genes.

    PubMed

    Bellen, Hugo J; Yamamoto, Shinya

    2015-09-24

    In 1915, "The Mechanism of Mendelian Heredity" was published by four prominent Drosophila geneticists. They discovered that genes form linkage groups on chromosomes inherited in a Mendelian fashion and laid the genetic foundation that promoted Drosophila as a model organism. Flies continue to offer great opportunities, including studies in the field of functional genomics. PMID:26406362

  6. Morgan’s Legacy: Fruit Flies and the Functional Annotation of Conserved Genes

    PubMed Central

    Bellen, Hugo J.; Yamamoto, Shinya

    2016-01-01

    In 1915, “The Mechanism of Mendelian Heredity” was published by four prominent Drosophila geneticists. They discovered that genes form linkage groups on chromosomes inherited in a Mendelian fashion and laid the genetic foundation that promoted Drosophila as a model organism. Flies continue to offer great opportunities, including studies in the field of functional genomics. PMID:26406362

  7. Developing of the Computer Method for Annotation of Bacterial Genes

    PubMed Central

    Golyshev, Mikhail A.; Korotkov, Eugene V.

    2015-01-01

    Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes. We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems. PMID:26770195

  8. Functional Annotation of Cotesia congregata Bracovirus: Identification of Viral Genes Expressed in Parasitized Host Immune Tissues

    PubMed Central

    Thézé, Julien; Cambier, Sébastien; Poulain, Julie; Da Silva, Corinne; Bézier, Annie; Musset, Karine; Moreau, Sébastien J. M.; Drezen, Jean-Michel

    2014-01-01

    ABSTRACT Bracoviruses (BVs) from the Polydnaviridae family are symbiotic viruses used as biological weapons by parasitoid wasps to manipulate lepidopteran host physiology and induce parasitism success. BV particles are produced by wasp ovaries and injected along with the eggs into the caterpillar host body, where viral gene expression is necessary for wasp development. Recent sequencing of the proviral genome of Cotesia congregata BV (CcBV) identified 222 predicted virulence genes present on 35 proviral segments integrated into the wasp genome. To date, the expressions of only a few selected candidate virulence genes have been studied in the caterpillar host, and we lacked a global vision of viral gene expression. In this study, a large-scale transcriptomic analysis by 454 sequencing of two immune tissues (fat body and hemocytes) of parasitized Manduca sexta caterpillar hosts allowed the detection of expression of 88 CcBV genes expressed 24 h after the onset of parasitism. We linked the expression profiles of these genes to several factors, showing that different regulatory mechanisms control viral gene expression in the host. These factors include the presence of signal peptides in encoded proteins, diversification of promoter regions, and, more surprisingly, gene position on the proviral genome. Indeed, most genes for which expression could be detected are localized in particular proviral regions globally producing higher numbers of circles. Moreover, this polydnavirus (PDV) transcriptomic analysis also reveals that a majority of CcBV genes possess at least one intron and an arthropod transcription start site, consistent with an insect origin of these virulence genes. IMPORTANCE Bracoviruses (BVs) are symbiotic polydnaviruses used by parasitoid wasps to manipulate lepidopteran host physiology, ensuring wasp offspring survival. To date, the expressions of only a few selected candidate BV virulence genes have been studied in caterpillar hosts. We performed a large-scale analysis of BV gene expression in two immune tissues of Manduca sexta caterpillars parasitized by Cotesia congregata wasps. Genes for which expression could be detected corresponded to genes localized in particular regions of the viral genome globally producing higher numbers of circles. Our study thus brings an original global vision of viral gene expression and paves the way to the determination of the regulatory mechanisms enabling the expression of BV genes in targeted organisms, such as major insect pests. In addition, we identify sequence features suggesting that most BV virulence genes were acquired from insect genomes. PMID:24872581

  9. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. Conclusions This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig’s adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response. PMID:23676093

  10. The GOA database: gene Ontology annotation updates for 2015.

    PubMed

    Huntley, Rachael P; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J; O'Donovan, Claire

    2015-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480,000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats. PMID:25378336

  11. The GOA database: Gene Ontology annotation updates for 2015

    PubMed Central

    Huntley, Rachael P.; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J.; O'Donovan, Claire

    2015-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats. PMID:25378336

  12. Functional Annotation Analytics of Rhodopseudomonas palustris Genomes

    PubMed Central

    Simmons, Shaneka S.; Isokpehi, Raphael D.; Brown, Shyretha D.; McAllister, Donee L.; Hall, Charnia C.; McDuffy, Wanaki M.; Medley, Tamara L.; Udensi, Udensi K.; Rajnarayanan, Rajendram V.; Ayensu, Wellington K.; Cohly, Hari H.P.

    2011-01-01

    Rhodopseudomonas palustris, a nonsulphur purple photosynthetic bacteria, has been extensively investigated for its metabolic versatility including ability to produce hydrogen gas from sunlight and biomass. The availability of the finished genome sequences of six R. palustris strains (BisA53, BisB18, BisB5, CGA009, HaA2 and TIE-1) combined with online bioinformatics software for integrated analysis presents new opportunities to determine the genomic basis of metabolic versatility and ecological lifestyles of the bacteria species. The purpose of this investigation was to compare the functional annotations available for multiple R. palustris genomes to identify annotations that can be further investigated for strain-specific or uniquely shared phenotypic characteristics. A total of 2,355 protein family Pfam domain annotations were clustered based on presence or absence in the six genomes. The clustering process identified groups of functional annotations including those that could be verified as strain-specific or uniquely shared phenotypes. For example, genes encoding water/glycerol transport were present in the genome sequences of strains CGA009 and BisB5, but absent in strains BisA53, BisB18, HaA2 and TIE-1. Protein structural homology modeling predicted that the two orthologous 240 aa R. palustris aquaporins have water-specific transport function. Based on observations in other microbes, the presence of aquaporin in R. palustris strains may improve freeze tolerance in natural conditions of rapid freezing such as nitrogen fixation at low temperatures where access to liquid water is a limiting factor for nitrogenase activation. In the case of adaptive loss of aquaporin genes, strains may be better adapted to survive in conditions of high-sugar content such as fermentation of biomass for biohydrogen production. Finally, web-based resources were developed to allow for interactive, user-defined selection of the relationship between protein family annotations and the R. palustris genomes. PMID:22084572

  13. Metagenomic gene annotation by a homology-independent approach

    SciTech Connect

    Froula, Jeff; Zhang, Tao; Salmeen, Annette; Hess, Matthias; Kerfeld, Cheryl A.; Wang, Zhong; Du, Changbin

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMER but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.

  14. COFECO: composite function annotation enriched by protein complex data

    PubMed Central

    Sun, Choong-Hyun; Kim, Min-Sung; Han, Youngwoong; Yi, Gwan-Su

    2009-01-01

    COFECO is a web-based tool for a composite annotation of protein complexes, KEGG pathways and Gene Ontology (GO) terms within a class of genes and their orthologs under study. Widely used functional enrichment tools using GO and KEGG pathways create large list of annotations that make it difficult to derive consolidated information and often include over-generalized terms. The interrelationship of annotation terms can be more clearly delineated by integrating the information of physically interacting proteins with biological pathways and GO terms. COFECO has the following advanced characteristics: (i) The composite annotation sets of correlated functions and cellular processes for a given gene set can be identified in a more comprehensive and specified way by the employment of protein complex data together with GO and KEGG pathways as annotation resources. (ii) Orthology based integrative annotations among different species complement the defective annotations in an individual genome and provide the information of evolutionary conserved correlations. (iii) A term filtering feature enables users to collect the specified annotations enriched with selected function terms. (iv) A cross-comparison of annotation results between two different datasets is possible. In addition, COFECO provides a web-based GO hierarchical viewer and KEGG pathway viewer where the enrichment results can be summarized and further explored. COFECO is freely accessible at http://piech.kaist.ac.kr/cofeco. PMID:19429688

  15. Functional Annotation of Small Noncoding RNAs Target Genes Provides Evidence for a Deregulated Ubiquitin-Proteasome Pathway in Spinocerebellar Ataxia Type 1

    PubMed Central

    Persengiev, Stephan; Kondova, Ivanela; Bontrop, Ronald E.

    2012-01-01

    Spinocerebellar ataxia type 1 (SCA1) is a neurodegenerative disorder caused by the expansion of CAG repeats in the ataxin 1 (ATXN1) gene. In affected cerebellar neurons of patients, mutant ATXN1 accumulates in ubiquitin-positive nuclear inclusions, indicating that protein misfolding is involved in SCA1 pathogenesis. In this study, we functionally annotated the target genes of the small noncoding RNAs (ncRNAs) that were selectively activated in the affected brain compartments. The primary targets of these RNAs, which exhibited a significant enrichment in the cerebellum and cortex of SCA1 patients, were members of the ubiquitin-proteasome system. Thus, we identified and functionally annotated a plausible regulatory pathway that may serve as a potential target to modulate the outcome of neurodegenerative diseases. PMID:23094141

  16. JGI Plant Genomics Gene Annotation Pipeline

    SciTech Connect

    Shu, Shengqiang; Rokhsar, Dan; Goodstein, David; Hayes, David; Mitros, Therese

    2014-07-14

    Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward this aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.

  17. The Gene Wiki: community intelligence applied to human gene annotation

    PubMed Central

    Huss, Jon W.; Lindenbaum, Pierre; Martone, Michael; Roberts, Donabel; Pizarro, Angel; Valafar, Faramarz; Hogenesch, John B.; Su, Andrew I.

    2010-01-01

    Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki ‘stubs’ for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki. PMID:19755503

  18. Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions

    PubMed Central

    Araújo, Wagner L.; Nunes-Nesi, Adriano; Williams, Thomas C. R.

    2012-01-01

    The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics, and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review, we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g., photosynthesis, photorespiration, and nitrogen metabolism). We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications. PMID:22973288

  19. Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA)

    PubMed Central

    2013-01-01

    The assignment of gene function remains a difficult but important task in computational biology. The establishment of the first Critical Assessment of Functional Annotation (CAFA) was aimed at increasing progress in the field. We present an independent analysis of the results of CAFA, aimed at identifying challenges in assessment and at understanding trends in prediction performance. We found that well-accepted methods based on sequence similarity (i.e., BLAST) have a dominant effect. Many of the most informative predictions turned out to be either recovering existing knowledge about sequence similarity or were "post-dictions" already documented in the literature. These results indicate that deep challenges remain in even defining the task of function assignment, with a particular difficulty posed by the problem of defining function in a way that is not dependent on either flawed gold standards or the input data itself. In particular, we suggest that using the Gene Ontology (or other similar systematizations of function) as a gold standard is unlikely to be the way forward. PMID:23630983

  20. Genetic Annotation of Gain-Of-Function Screens Using RNA Interference and in Situ Hybridization of Candidate Genes in the Drosophila Wing

    PubMed Central

    Molnar, Cristina; Casado, Mar; López-Varea, Ana; Cruz, Cristina; de Celis, Jose F.

    2012-01-01

    Gain-of-function screens in Drosophila are an effective method with which to identify genes that affect the development of particular structures or cell types. It has been found that a fraction of 2–10% of the genes tested, depending on the particularities of the screen, results in a discernible phenotype when overexpressed. However, it is not clear to what extent a gain-of-function phenotype generated by overexpression is informative about the normal function of the gene. Thus, very few reports attempt to correlate the loss- and overexpression phenotype for collections of genes identified in gain-of-function screens. In this work we use RNA interference and in situ hybridization to annotate a collection of 123 P-GS insertions that in combination with different Gal4 drivers affect the size and/or patterning of the wing. We identify the gene causing the overexpression phenotype by expressing, in a background of overexpression, RNA interference for the genes affected by each P-GS insertion. Then, we compare the loss and gain-of-function phenotypes obtained for each gene and relate them to its expression pattern in the wing disc. We find that 52% of genes identified by their overexpression phenotype are required during normal development. However, only in 9% of the cases analyzed was there some complementarity between the gain- and loss-of-function phenotype, suggesting that, in general, the overexpression phenotypes would not be indicative of the normal requirements of the gene. PMID:22798488

  1. Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of “Unknown Function”

    PubMed Central

    Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

    2012-01-01

    Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000?Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs. PMID:22645570

  2. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    PubMed Central

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and phylogeny for the entire currently known VvTPS gene family. PMID:20964856

  3. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services). PMID:25343866

  4. Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations

    PubMed Central

    2015-01-01

    Background Increasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs). Results By comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins. Conclusions Although identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries. PMID:26046679

  5. Functional Annotation of Rheumatoid Arthritis and Osteoarthritis Associated Genes by Integrative Genome-Wide Gene Expression Profiling Analysis

    PubMed Central

    Li, Zhan-Chun; Xiao, Jie; Peng, Jin-Liang; Chen, Jian-Wei; Ma, Tao; Cheng, Guang-Qi; Dong, Yu-Qi; Wang, Wei-li; Liu, Zu-De

    2014-01-01

    Background Rheumatoid arthritis (RA) and osteoarthritis (OA) are two major types of joint diseases that share multiple common symptoms. However, their pathological mechanism remains largely unknown. The aim of our study is to identify RA and OA related-genes and gain an insight into the underlying genetic basis of these diseases. Methods We collected 11 whole genome-wide expression profiling datasets from RA and OA cohorts and performed a meta-analysis to comprehensively investigate their expression signatures. This method can avoid some pitfalls of single dataset analyses. Results and Conclusion We found that several biological pathways (i.e., the immunity, inflammation and apoptosis related pathways) are commonly involved in the development of both RA and OA. Whereas several other pathways (i.e., vasopressin-related pathway, regulation of autophagy, endocytosis, calcium transport and endoplasmic reticulum stress related pathways) present significant difference between RA and OA. This study provides novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease. PMID:24551036

  6. GFam: a platform for automatic annotation of gene families

    PubMed Central

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-01-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/. PMID:22790981

  7. BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments

    PubMed Central

    Al-Shahrour, Fátima; Minguez, Pablo; Vaquerizas, Juan M.; Conde, Lucía; Dopazo, Joaquín

    2005-01-01

    We present Babelomics, a complete suite of web tools for the functional analysis of groups of genes in high-throughput experiments, which includes the use of information on Gene Ontology terms, interpro motifs, KEGG pathways, Swiss-Prot keywords, analysis of predicted transcription factor binding sites, chromosomal positions and presence in tissues with determined histological characteristics, through five integrated modules: FatiGO (fast assignment and transference of information), FatiWise, transcription factor association test, GenomeGO and tissues mining tool, respectively. Additionally, another module, FatiScan, provides a new procedure that integrates biological information in combination with experimental results in order to find groups of genes with modest but coordinate significant differential behaviour. FatiScan is highly sensitive and is capable of finding significant asymmetries in the distribution of genes of common function across a list of ordered genes even if these asymmetries were not extreme. The strong multiple-testing nature of the contrasts made by the tools is taken into account. All the tools are integrated in the gene expression analysis package GEPAS. Babelomics is the natural evolution of our tool FatiGO (which analysed almost 22 000 experiments during the last year) to include more sources on information and new modes of using it. Babelomics can be found at . PMID:15980512

  8. JAFA: a protein function annotation meta-server

    PubMed Central

    Friedberg, Iddo; Harder, Tim; Godzik, Adam

    2006-01-01

    With the high number of sequences and structures streaming in from genomic projects, there is a need for more powerful and sophisticated annotation tools. Most problematic of the annotation efforts is predicting gene and protein function. Over the past few years there has been considerable progress in automated protein function prediction, using a diverse set of methods. Nevertheless, no single method reports all the information possible, and molecular biologists resort to ‘shopping around’ using different methods: a cumbersome and time-consuming practice. Here we present the Joined Assembly of Function Annotations, or JAFA server. JAFA queries several function prediction servers with a protein sequence and assembles the returned predictions in a legible, non-redundant format. In this manner, JAFA combines the predictions of several servers to provide a comprehensive view of what are the predicted functions of the proteins. JAFA also offers its own output, and the individual programs' predictions for further processing. JAFA is available for use from . PMID:16845030

  9. Conceptualization of molecular findings by mining gene annotations

    PubMed Central

    2013-01-01

    Background The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner. Methods In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations. Results We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph. Conclusions Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion. PMID:24564884

  10. Critical Assessment of Function Annotation Meeting, 2011

    SciTech Connect

    Friedberg, Iddo

    2015-01-21

    The Critical Assessment of Function Annotation meeting was held July 14-15, 2011 at the Austria Conference Center in Vienna, Austria. There were 73 registered delegates at the meeting. We thank the DOE for this award. It helped us organize and support a scientific meeting AFP 2011 as a special interest group (SIG) meeting associated with the ISMB 2011 conference. The conference was held in Vienna, Austria, in July 2011. The AFP SIG was held on July 15-16, 2011 (immediately preceding the conference). The meeting consisted of two components, the first being a series of talks (invited and contributed) and discussion sections dedicated to protein function research, with an emphasis on the theory and practice of computational methods utilized in functional annotation. The second component provided a large-scale assessment of computational methods through participation in the Critical Assessment of Functional Annotation (CAFA).

  11. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

    PubMed Central

    Andorf, Carson; Dobbs, Drena; Honavar, Vasant

    2007-01-01

    Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence. PMID:17683567

  12. KEGG as a reference resource for gene and protein annotation

    PubMed Central

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-01

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. PMID:26476454

  13. KEGG as a reference resource for gene and protein annotation.

    PubMed

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-01

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. PMID:26476454

  14. GeneDecks: paralog hunting and gene-set distillation with GeneCards annotation.

    PubMed

    Stelzer, Gil; Inger, Aron; Olender, Tsviya; Iny-Stein, Tsippi; Dalah, Irina; Harel, Arye; Safran, Marilyn; Lancet, Doron

    2009-12-01

    Sophisticated genomic navigation strongly benefits from a capacity to establish a similarity metric among genes. GeneDecks is a novel analysis tool that provides such a metric by highlighting shared descriptors between pairs of genes, based on the rich annotation within the GeneCards compendium of human genes. The current implementation addresses information about pathways, protein domains, Gene Ontology (GO) terms, mouse phenotypes, mRNA expression patterns, disorders, drug relationships, and sequence-based paralogy. GeneDecks has two modes: (1) Paralog Hunter, which seeks functional paralogs based on combinatorial similarity of attributes; and (2) Set Distiller, which ranks descriptors by their degree of sharing within a given gene set. GeneDecks enables the elucidation of unsuspected putative functional paralogs, and a refined scrutiny of various gene-sets (e.g., from high-throughput experiments) for discovering relevant biological patterns. PMID:20001862

  15. An Integrated Framework for Functional Annotation of Protein Structural Domains.

    PubMed

    Deng, Lei; Chen, Zhigang

    2015-01-01

    Structural domains are evolutionary and functional units of proteins and play a critical role in comparative and functional genomics. Computational assignment of domain function with high reliability is essential for understanding whole-protein functions. However, functional annotations are conventionally assigned onto full-length proteins rather than associating specific functions to the individual structural domains. In this article, we present Structural Domain Annotation (SDA), a novel computational approach to predict functions for SCOP structural domains. The SDA method integrates heterogeneous information sources, including structure alignment based protein-SCOP mapping features, InterPro2GO mapping information, PSSM Profiles, and sequence neighborhood features, with a Bayesian network. By large-scale annotating Gene Ontology terms to SCOP domains with SDA, we obtained a database of SCOP domain to Gene Ontology mappings, which contains ~162,000 out of the approximately 166,900 domains in SCOPe 2.03 (>97 percent) and their predicted Gene Ontology functions. We have benchmarked SDA using a single-domain protein dataset and an independent dataset from different species. Comparative studies show that SDA significantly outperforms the existing function prediction methods for structural domains in terms of coverage and maximum F-measure. PMID:26357331

  16. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

    PubMed Central

    Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered mollusc, C. plicata. The transcriptome provides a comprehensive sequence resource for the conservation of genetic information in this species and enrichment of the genetic database. The development of molecular markers will assist in the genetic improvement of C. plicata. PMID:26872384

  17. Construction of coffee transcriptome networks based on gene annotation semantics.

    PubMed

    Castillo, Luis F; Galeano, Narmer; Isaza, Gustavo A; Gaitán, Alvaro

    2012-01-01

    Gene annotation is a process that encompasses multiple approaches on the analysis of nucleic acids or protein sequences in order to assign structural and functional characteristics to gene models. When thousands of gene models are being described in an organism genome, construction and visualization of gene networks impose novel challenges in the understanding of complex expression patterns and the generation of new knowledge in genomics research. In order to take advantage of accumulated text data after conventional gene sequence analysis, this work applied semantics in combination with visualization tools to build transcriptome networks from a set of coffee gene annotations. A set of selected coffee transcriptome sequences, chosen by the quality of the sequence comparison reported by Basic Local Alignment Search Tool (BLAST) and Interproscan, were filtered out by coverage, identity, length of the query, and e-values. Meanwhile, term descriptors for molecular biology and biochemistry were obtained along the Wordnet dictionary in order to construct a Resource Description Framework (RDF) using Ruby scripts and Methontology to find associations between concepts. Relationships between sequence annotations and semantic concepts were graphically represented through a total of 6845 oriented vectors, which were reduced to 745 non-redundant associations. A large gene network connecting transcripts by way of relational concepts was created where detailed connections remain to be validated for biological significance based on current biochemical and genetics frameworks. Besides reusing text information in the generation of gene connections and for data mining purposes, this tool development opens the possibility to visualize complex and abundant transcriptome data, and triggers the formulation of new hypotheses in metabolic pathways analysis. PMID:22829576

  18. SFannotation: A Simple and Fast Protein Function Annotation System.

    PubMed

    Yu, Dong Su; Kim, Byung Kwon

    2014-06-01

    Owing to the generation of vast amounts of sequencing data by using cost-effective, high-throughput sequencing technologies with improved computational approaches, many putative proteins have been discovered after assembly and structural annotation. Putative proteins are typically annotated using a functional annotation system that uses extant databases, but the expansive size of these databases often causes a bottleneck for rapid functional annotation. We developed SFannotation, a simple and fast functional annotation system that rapidly annotates putative proteins against four extant databases, Swiss-Prot, TIGRFAMs, Pfam, and the non-redundant sequence database, by using a best-hit approach with BLASTP and HMMSEARCH. PMID:25031571

  19. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.

    PubMed

    Camon, Evelyn; Magrane, Michele; Barrell, Daniel; Lee, Vivian; Dimmer, Emily; Maslen, John; Binns, David; Harte, Nicola; Lopez, Rodrigo; Apweiler, Rolf

    2004-01-01

    The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60,000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: goa@ebi.ac.uk. PMID:14681408

  20. A categorization approach to automated ontological function annotation

    PubMed Central

    Verspoor, Karin; Cohn, Judith; Mniszewski, Susan; Joslyn, Cliff

    2006-01-01

    Automated function prediction (AFP) methods increasingly use knowledge discovery algorithms to map sequence, structure, literature, and/or pathway information about proteins whose functions are unknown into functional ontologies, typically (a portion of) the Gene Ontology (GO). While there are a growing number of methods within this paradigm, the general problem of assessing the accuracy of such prediction algorithms has not been seriously addressed. We present first an application for function prediction from protein sequences using the POSet Ontology Categorizer (POSOC) to produce new annotations by analyzing collections of GO nodes derived from annotations of protein BLAST neighborhoods. We then also present hierarchical precision and hierarchical recall as new evaluation metrics for assessing the accuracy of any predictions in hierarchical ontologies, and discuss results on a test set of protein sequences. We show that our method provides substantially improved hierarchical precision (measure of predictions made that are correct) when applied to the nearest BLAST neighbors of target proteins, as compared with simply imputing that neighborhood's annotations to the target. Moreover, when our method is applied to a broader BLAST neighborhood, hierarchical precision is enhanced even further. In all cases, such increased hierarchical precision performance is purchased at a modest expense of hierarchical recall (measure of all annotations that get predicted at all). PMID:16672243

  1. HMM-Based Gene Annotation Methods

    SciTech Connect

    Haussler, David; Hughey, Richard; Karplus, Keven

    1999-09-20

    Development of new statistical methods and computational tools to identify genes in human genomic DNA, and to provide clues to their functions by identifying features such as transcription factor binding sites, tissue, specific expression and splicing patterns, and remove homologies at the protein level with genes of known function.

  2. Comprehensive comparative homeobox gene annotation in human and mouse

    PubMed Central

    Wilming, Laurens G.; Boychenko, Veronika; Harrow, Jennifer L.

    2015-01-01

    Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence. PMID:26412852

  3. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data

    PubMed Central

    Matthews, Beverley B.; dos Santos, Gilberto; Crosby, Madeline A.; Emmert, David B.; St. Pierre, Susan E.; Gramates, L. Sian; Zhou, Pinglei; Schroeder, Andrew J.; Falls, Kathleen; Strelets, Victor; Russo, Susan M.; Gelbart, William M.

    2015-01-01

    We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3? UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts. PMID:26109357

  4. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data.

    PubMed

    Matthews, Beverley B; Dos Santos, Gilberto; Crosby, Madeline A; Emmert, David B; St Pierre, Susan E; Gramates, L Sian; Zhou, Pinglei; Schroeder, Andrew J; Falls, Kathleen; Strelets, Victor; Russo, Susan M; Gelbart, William M

    2015-08-01

    We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3' UTRs (up to 15-18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts. PMID:26109357

  5. dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

    PubMed Central

    Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

    2013-01-01

    The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284

  6. Evolutionary Trace Annotation of Protein Function in the Structural Proteome

    PubMed Central

    Erdin, Serkan; Ward, R. Matthew; Venner, Eric

    2010-01-01

    By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1– 3 (depth 3 PPV). In a high sensitivity mode coverage rose significantly (84%) while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 un-annotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta. PMID:20036248

  7. Software Suite for Gene and Protein Annotation Prediction and Similarity Search.

    PubMed

    Chicco, Davide; Masseroli, Marco

    2015-01-01

    In the computational biology community, machine learning algorithms are key instruments for many applications, including the prediction of gene-functions based upon the available biomolecular annotations. Additionally, they may also be employed to compute similarity between genes or proteins. Here, we describe and discuss a software suite we developed to implement and make publicly available some of such prediction methods and a computational technique based upon Latent Semantic Indexing (LSI), which leverages both inferred and available annotations to search for semantically similar genes. The suite consists of three components. BioAnnotationPredictor is a computational software module to predict new gene-functions based upon Singular Value Decomposition of available annotations. SimilBio is a Web module that leverages annotations available or predicted by BioAnnotationPredictor to discover similarities between genes via LSI. The suite includes also SemSim, a new Web service built upon these modules to allow accessing them programmatically. We integrated SemSim in the Bio Search Computing framework (http://www.bioinformatics.deib. polimi.it/bio-seco/seco/), where users can exploit the Search Computing technology to run multi-topic complex queries on multiple integrated Web services. Accordingly, researchers may obtain ranked answers involving the computation of the functional similarity between genes in support of biomedical knowledge discovery. PMID:26357324

  8. A robust data-driven approach for gene ontology annotation

    PubMed Central

    Li, Yanpeng; Yu, Hong

    2014-01-01

    Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. PMID:25425037

  9. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes

    PubMed Central

    Martin, David MA; Berriman, Matthew; Barton, Geoffrey J

    2004-01-01

    Background The function of a novel gene product is typically predicted by transitive assignment of annotation from similar sequences. We describe a novel method, GOtcha, for predicting gene product function by annotation with Gene Ontology (GO) terms. GOtcha predicts GO term associations with term-specific probability (P-score) measures of confidence. Term-specific probabilities are a novel feature of GOtcha and allow the identification of conflicts or uncertainty in annotation. Results The GOtcha method was applied to the recently sequenced genome for Plasmodium falciparum and six other genomes. GOtcha was compared quantitatively for retrieval of assigned GO terms against direct transitive assignment from the highest scoring annotated BLAST search hit (TOPBLAST). GOtcha exploits information deep into the 'twilight zone' of similarity search matches, making use of much information that is otherwise discarded by more simplistic approaches. At a P-score cutoff of 50%, GOtcha provided 60% better recovery of annotation terms and 20% higher selectivity than annotation with TOPBLAST at an E-value cutoff of 10-4. Conclusions The GOtcha method is a useful tool for genome annotators. It has identified both errors and omissions in the original Plasmodium falciparum annotation and is being adopted by many other genome sequencing projects. PMID:15550167

  10. CATH: comprehensive structural and functional annotations for genome sequences

    PubMed Central

    Sillitoe, Ian; Lewis, Tony E.; Cuff, Alison; Das, Sayoni; Ashford, Paul; Dawson, Natalie L.; Furnham, Nicholas; Laskowski, Roman A.; Lee, David; Lees, Jonathan G.; Lehtinen, Sonja; Studer, Romain A.; Thornton, Janet; Orengo, Christine A.

    2015-01-01

    The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235 000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our ‘current’ putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages. PMID:25348408

  11. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout.

    PubMed

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H; Rexroad, Caird E; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000-32,000 genes (35-71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  12. GeneDB—an annotation database for pathogens

    PubMed Central

    Logan-Klumpler, Flora J.; De Silva, Nishadi; Boehme, Ulrike; Rogers, Matthew B.; Velarde, Giles; McQuillan, Jacqueline A.; Carver, Tim; Aslett, Martin; Olsen, Christian; Subramanian, Sandhya; Phan, Isabelle; Farris, Carol; Mitra, Siddhartha; Ramasamy, Gowthaman; Wang, Haiming; Tivey, Adrian; Jackson, Andrew; Houston, Robin; Parkhill, Julian; Holden, Matthew; Harb, Omar S.; Brunk, Brian P.; Myler, Peter J.; Roos, David; Carrington, Mark; Smith, Deborah F.; Hertz-Fowler, Christiane; Berriman, Matthew

    2012-01-01

    GeneDB (http://www.genedb.org) is a genome database for prokaryotic and eukaryotic pathogens and closely related organisms. The resource provides a portal to genome sequence and annotation data, which is primarily generated by the Pathogen Genomics group at the Wellcome Trust Sanger Institute. It combines data from completed and ongoing genome projects with curated annotation, which is readily accessible from a web based resource. The development of the database in recent years has focused on providing database-driven annotation tools and pipelines, as well as catering for increasingly frequent assembly updates. The website has been significantly redesigned to take advantage of current web technologies, and improve usability. The current release stores 41 data sets, of which 17 are manually curated and maintained by biologists, who review and incorporate data from the scientific literature, as well as other sources. GeneDB is primarily a production and annotation database for the genomes of predominantly pathogenic organisms. PMID:22116062

  13. High-throughput functional annotation and data mining with the Blast2GO suite.

    PubMed

    Götz, Stefan; García-Gómez, Juan Miguel; Terol, Javier; Williams, Tim D; Nagaraj, Shivashankar H; Nueda, María José; Robles, Montserrat; Talón, Manuel; Dopazo, Joaquín; Conesa, Ana

    2008-06-01

    Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data. PMID:18445632

  14. High-throughput functional annotation and data mining with the Blast2GO suite

    PubMed Central

    Götz, Stefan; García-Gómez, Juan Miguel; Terol, Javier; Williams, Tim D.; Nagaraj, Shivashankar H.; Nueda, María José; Robles, Montserrat; Talón, Manuel; Dopazo, Joaquín; Conesa, Ana

    2008-01-01

    Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data. PMID:18445632

  15. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

    PubMed Central

    Daraselia, Nikolai; Yuryev, Anton; Egorov, Sergei; Mazo, Ilya; Ispolatov, Iaroslav

    2007-01-01

    Background Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets. Results We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller. Conclusion Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity. PMID:17620146

  16. GLAD: an Online Database of Gene List Annotation for Drosophila

    PubMed Central

    Hu, Yanhui; Comjean, Aram; Perkins, Lizabeth A.; Perrimon, Norbert; Mohr, Stephanie E.

    2015-01-01

    We present a resource of high quality lists of functionally related Drosophila genes, e.g. based on protein domains (kinases, transcription factors, etc.) or cellular function (e.g. autophagy, signal transduction). To establish these lists, we relied on different inputs, including curation from databases or the literature and mapping from other species. Moreover, as an added curation and quality control step, we asked experts in relevant fields to review many of the lists. The resource is available online for scientists to search and view, and is editable based on community input. Annotation of gene groups is an ongoing effort and scientific need will typically drive decisions regarding which gene lists to pursue. We anticipate that the number of lists will increase over time; that the composition of some lists will grow and/or change over time as new information becomes available; and that the lists will benefit the scientific community, e.g. at experimental design and data analysis stages. Based on this, we present an easily updatable online database, available at www.flyrnai.org/glad, at which gene group lists can be viewed, searched and downloaded. PMID:26157507

  17. Dizeez: an online game for human gene-disease annotation.

    PubMed

    Loguercio, Salvatore; Good, Benjamin M; Su, Andrew I

    2013-01-01

    Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org. PMID:23951102

  18. AnnotQTL: a new tool to gather functional and comparative information on a genomic region

    PubMed Central

    Lecerf, F.; Bretaudeau, A.; Sallou, O.; Desert, C.; Blum, Y.; Lagarrigue, S.; Demeure, O.

    2011-01-01

    AnnotQTL is a web tool designed to aggregate functional annotations from different prominent web sites by minimizing the redundancy of information. Although thousands of QTL regions have been identified in livestock species, most of them are large and contain many genes. This tool was therefore designed to assist the characterization of genes in a QTL interval region as a step towards selecting the best candidate genes. It localizes the gene to a specific region (using NCBI and Ensembl data) and adds the functional annotations available from other databases (Gene Ontology, Mammalian Phenotype, HGNC and Pubmed). Both human genome and mouse genome can be aligned with the studied region to detect synteny and segment conservation, which is useful for running inter-species comparisons of QTL locations. Finally, custom marker lists can be included in the results display to select the genes that are closest to your most significant markers. We use examples to demonstrate that in just a couple of hours, AnnotQTL is able to identify all the genes located in regions identified by a full genome scan, with some highlighted based on both location and function, thus considerably increasing the chances of finding good candidate genes. AnnotQTL is available at http://annotqtl.genouest.org. PMID:21596783

  19. A model selection criterion for model-based clustering of annotated gene expression data.

    PubMed

    Gallopin, Mélina; Celeux, Gilles; Jaffrézic, Florence; Rau, Andrea

    2015-11-01

    In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data. PMID:26461845

  20. Functional annotation of introns in mitochondrial genome - a brief review.

    PubMed

    Anandakumar, Shanmugam; Ravindran, Suda Parimala; Shanmughavel, Piramanayagam

    2016-03-01

    The present study is to decipher the non-coding regions present in mitochondrial genomes that cause diseases in humans and predict their functional roles through comparative genomics approach followed by functional annotation of these segments. PMID:24845436

  1. De Novo Assembly, Functional Annotation and Comparative Analysis of Withania somnifera Leaf and Root Transcriptomes to Identify Putative Genes Involved in the Withanolides Biosynthesis

    PubMed Central

    Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

    2013-01-01

    Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches. PMID:23667511

  2. CATH FunFHMMer web server: protein functional annotations using functional family assignments.

    PubMed

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G; Dawson, Natalie L; Ward, John; Orengo, Christine A

    2015-07-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer. PMID:25964299

  3. Functional annotation of a full-length mouse cDNA collection

    SciTech Connect

    Kawai, J.; Shinagawa, A.; Shibata, K.; Yoshino, M.; Itoh, M.; Ishii, Y.; Arakawa, T.; Hara, A.; Fukunishi, Y.; Konno, H.; Adachi, J.; Fukuda, S.; Aizawa, K.; Izawa, M.; Nishi, K.; Kiyosawa, H.; Kondo, S.; Yamanaka, I.; Saito, T.; Okazaki, Y.; Gojobori, T.; Bono, H.; Kasukawa, T.; Saito, R.; Kadota, K.; Matsuda, H.; Ashburner, M.; Batalov, S.; Casavant, T.; Fleischmann, W.; Gaasterland, T.; Gissi, C.; King, B.; Kochiwa, H.; Kuehl, P.; Lewis, S.; Matsuo, Y.; Nikaido, I.; Pesole, G.; Quackenbush, J.; Schriml, L.M.; Staubli, F.; Suzuki, R.; Tomita, M.; Wagner, L.; Washio, T.; Sakai, K.; Okido, T.; Furuno, M.; Aono, H.; Baldarelli, R.; Barsh, G.; Blake, J.; Boffelli, D.; Bojunga, N.; Carninci, P.; de Bonaldo, M.F.; Brownstein, M.J.; Bult, C.; Fletcher, C.; Fujita, M.; Gariboldi, M.; Gustincich, S.; Hill, D.; Hofmann, M.; Hume, D.A.; Kamiya, M.; Lee, N.H.; Lyons, P.; Marchionni, L.; Mashima, J.; Mazzarelli, J.; Mombaerts, P.; Nordone, P.; Ring, B.; Ringwald, M.; Rodriguez, I.; Sakamoto, N.; Sasaki, H.; Sato, K.; Schonbach, C.; Seya, T.; Shibata, Y.; Storch, K.-F.; Suzuki, H.; Toyo-oka, K.; Wang, K.H.; Weitz, C.; Whittaker, C.; Wilming, L.; Wynshaw-Boris, A.; Yoshida, K.; Hasegawa, Y.; Kawaji, H.; Kohtsuki, S.; Hayashizaki, Y.; RIKEN Genome Exploration Research Group Phase II T; FANTOM Consortium

    2001-01-01

    The RIKEN Mouse Gene Encyclopedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analyzed in this project. Here we describe the first RIKEN clone collection, which is one of the largest described for any organism. Analysis of these cDNAs extends known gene families and identifies new ones.

  4. Functional annotation of colon cancer risk SNPs

    PubMed Central

    Yao, Lijing; Tak, Yu Gyoung; Berman, Benjamin P.; Farnham, Peggy J.

    2014-01-01

    Colorectal cancer (CRC) is a leading cause of cancer-related deaths in the United States. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for CRC. A molecular understanding of the functional consequences of this genetic variation has been complicated because each GWAS SNP is a surrogate for hundreds of other SNPs, most of which are located in non-coding regions. Here we use genomic and epigenomic information to test the hypothesis that the GWAS SNPs and/or correlated SNPs are in elements that regulate gene expression, and identify 23 promoters and 28 enhancers. Using gene expression data from normal and tumour cells, we identify 66 putative target genes of the risk-associated enhancers (10 of which were also identified by promoter SNPs). Employing CRISPR nucleases, we delete one risk-associated enhancer and identify genes showing altered expression. We suggest that similar studies be performed to characterize all CRC risk-associated enhancers. PMID:25268989

  5. Lynx web services for annotations and systems analysis of multi-gene disorders

    PubMed Central

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J.; Foster, Ian T.; Gilliam, T. Conrad; Maltsev, Natalia

    2014-01-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

  6. Lynx web services for annotations and systems analysis of multi-gene disorders.

    PubMed

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia

    2014-07-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

  7. Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions

    PubMed Central

    2014-01-01

    Background The number of genome-wide association studies (GWAS) has increased rapidly in the past couple of years, resulting in the identification of genes associated with different diseases. The next step in translating these findings into biomedically useful information is to find out the mechanism of the action of these genes. However, GWAS studies often implicate genes whose functions are currently unknown; for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with breast cancer, but their molecular function is unknown. Results We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes by employing the directed acyclic graph structure of GO and the network of protein-protein interactions (PPIs). The approach is designed based on the fact that two proteins that interact biophysically would be in physical proximity of each other, would possess complementary molecular function, and play role in related biological processes. Predicted GO terms were ranked according to their relative association scores and the approach was evaluated quantitatively by plotting the precision versus recall values and F-scores (the harmonic mean of precision and recall) versus varying thresholds. Precisions of ~58% and?~?40% for localization and functions respectively of proteins were determined at a threshold of ~30 (top 30 GO terms in the ranked list). Comparison with function prediction based on semantic similarity among nodes in an ontology and incorporation of those similarities in a k-nearest neighbor classifier confirmed that our results compared favorably. Conclusions This approach was applied to predict the cellular component and molecular function GO terms of all human proteins that have interacting partners possessing at least one known GO annotation. The list of predictions is available at http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the algorithm, evaluations and the results of the computational predictions, especially for genes identified in GWAS studies to be associated with diseases, which are of translational interest. PMID:24708602

  8. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  9. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216.

    PubMed

    Franco, Mario E E; López, Silvina; Medina, Rocio; Saparrat, Mario C N; Balatti, Pedro

    2015-01-01

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation. PMID:26404600

  10. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216

    PubMed Central

    Franco, Mario E. E.; López, Silvina; Medina, Rocio; Saparrat, Mario C. N.

    2015-01-01

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation. PMID:26404600

  11. Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads

    PubMed Central

    Carr, Rogan; Borenstein, Elhanan

    2014-01-01

    To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research. PMID:25148512

  12. Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout

    PubMed Central

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  13. Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1

    SciTech Connect

    Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

    2009-12-24

    Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  14. Annotation of human chromosome 21 for relevance to Down syndrome: gene structure and expression analysis.

    PubMed

    Gardiner, Katheleen; Slavov, Dobromir; Bechtel, Lawrence; Davisson, Muriel

    2002-06-01

    Down syndrome is caused by an extra copy of human chromosome 21 and the resultant dosage-related overexpression of genes contained within it. To efficiently direct experiments to determine specific gene-phenotype correlations, it is necessary to identify all genes within 21q and assess their functional associations and expression patterns. Analysis of the complete finished sequence of 21q resulted in annotated 225 genes and gene models, most of which were incomplete and/or had little or no experimental verification. Here we correct or complete the genomic structures of 16 genes, 4 of which were not reported in the annotation of the complete sequence. Our data include the identification of six genes encoding short or ambiguous open reading frames; the identification of three cases in which alternative splicing produces two structurally unrelated protein sequences; and the identification of six genes encoding proteins with functional motifs, two genes with unusually low similarity to their orthologous mouse proteins, and four genes with significant conservation in Drosophila melanogaster. We further demonstrate that an additional nine gene models represent bona fide transcripts and develop expression patterns for these genes plus nine additional novel chromosome 21 genes and four paralogous genes mapping elsewhere in the human genome. These data have implications for generating complete transcript maps of chromosome 21 and for the entire human genome, and for defining expression abnormalities in Down syndrome and mouse models. PMID:12036298

  15. COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets

    PubMed Central

    Bose, Tungadri; Haque, Mohammed Monzoorul; Reddy, CVSK; Mande, Sharmila S.

    2015-01-01

    Background Recent advances in sequencing technologies have resulted in an unprecedented increase in the number of metagenomes that are being sequenced world-wide. Given their volume, functional annotation of metagenomic sequence datasets requires specialized computational tools/techniques. In spite of having high accuracy, existing stand-alone functional annotation tools necessitate end-users to perform compute-intensive homology searches of metagenomic datasets against "multiple" databases prior to functional analysis. Although, web-based functional annotation servers address to some extent the problem of availability of compute resources, uploading and analyzing huge volumes of sequence data on a shared public web-service has its own set of limitations. In this study, we present COGNIZER, a comprehensive stand-alone annotation framework which enables end-users to functionally annotate sequences constituting metagenomic datasets. The COGNIZER framework provides multiple workflow options. A subset of these options employs a novel directed-search strategy which helps in reducing the overall compute requirements for end-users. The COGNIZER framework includes a cross-mapping database that enables end-users to simultaneously derive/infer KEGG, Pfam, GO, and SEED subsystem information from the COG annotations. Results Validation experiments performed with real-world metagenomes and metatranscriptomes, generated using diverse sequencing technologies, indicate that the novel directed-search strategy employed in COGNIZER helps in reducing the compute requirements without significant loss in annotation accuracy. A comparison of COGNIZER's results with pre-computed benchmark values indicate the reliability of the cross-mapping database employed in COGNIZER. Conclusion The COGNIZER framework is capable of comprehensively annotating any metagenomic or metatranscriptomic dataset from varied sequencing platforms in functional terms. Multiple search options in COGNIZER provide end-users the flexibility of choosing a homology search protocol based on available compute resources. The cross-mapping database in COGNIZER is of high utility since it enables end-users to directly infer/derive KEGG, Pfam, GO, and SEED subsystem annotations from COG categorizations. Furthermore, availability of COGNIZER as a stand-alone scalable implementation is expected to make it a valuable annotation tool in the field of metagenomic research. Availability and Implementation A Linux implementation of COGNIZER is freely available for download from the following links: http://metagenomics.atc.tcs.com/cognizer, https://metagenomics.atc.tcs.com/function/cognizer. PMID:26561344

  16. Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae

    PubMed Central

    2013-01-01

    Background Secondary metabolite production, a hallmark of filamentous fungi, is an expanding area of research for the Aspergilli. These compounds are potent chemicals, ranging from deadly toxins to therapeutic antibiotics to potential anti-cancer drugs. The genome sequences for multiple Aspergilli have been determined, and provide a wealth of predictive information about secondary metabolite production. Sequence analysis and gene overexpression strategies have enabled the discovery of novel secondary metabolites and the genes involved in their biosynthesis. The Aspergillus Genome Database (AspGD) provides a central repository for gene annotation and protein information for Aspergillus species. These annotations include Gene Ontology (GO) terms, phenotype data, gene names and descriptions and they are crucial for interpreting both small- and large-scale data and for aiding in the design of new experiments that further Aspergillus research. Results We have manually curated Biological Process GO annotations for all genes in AspGD with recorded functions in secondary metabolite production, adding new GO terms that specifically describe each secondary metabolite. We then leveraged these new annotations to predict roles in secondary metabolism for genes lacking experimental characterization. As a starting point for manually annotating Aspergillus secondary metabolite gene clusters, we used antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) and SMURF (Secondary Metabolite Unknown Regions Finder) algorithms to identify potential clusters in A. nidulans, A. fumigatus, A. niger and A. oryzae, which we subsequently refined through manual curation. Conclusions This set of 266 manually curated secondary metabolite gene clusters will facilitate the investigation of novel Aspergillus secondary metabolites. PMID:23617571

  17. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools

    PubMed Central

    Lamesch, Philippe; Berardini, Tanya Z.; Li, Donghui; Swarbreck, David; Wilks, Christopher; Sasidharan, Rajkumar; Muller, Robert; Dreher, Kate; Alexander, Debbie L.; Garcia-Hernandez, Margarita; Karthikeyan, Athikkattuvalasu S.; Lee, Cynthia H.; Nelson, William D.; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

    2012-01-01

    The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature. PMID:22140109

  18. Biocuration of functional annotation at the European nucleotide archive.

    PubMed

    Gibson, Richard; Alako, Blaise; Amid, Clara; Cerdeño-Tárraga, Ana; Cleland, Iain; Goodgame, Neil; Ten Hoopen, Petra; Jayathilaka, Suran; Kay, Simon; Leinonen, Rasko; Liu, Xin; Pallreddy, Swapna; Pakseresht, Nima; Rajan, Jeena; Rosselló, Marc; Silvester, Nicole; Smirnov, Dmitriy; Toribio, Ana Luisa; Vaughan, Daniel; Zalunin, Vadim; Cochrane, Guy

    2016-01-01

    The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the submission, maintenance and presentation of nucleotide sequence data and related sample and experimental information. In this article we report on ENA in 2015 regarding general activity, notable published data sets and major achievements. This is followed by a focus on sustainable biocuration of functional annotation, an area which has particularly felt the pressure of sequencing growth. The importance of functional annotation, how it can be submitted and the shifting role of the biocurator in the context of increasing volumes of data are all discussed. PMID:26615190

  19. Biocuration of functional annotation at the European nucleotide archive

    PubMed Central

    Gibson, Richard; Alako, Blaise; Amid, Clara; Cerdeño-Tárraga, Ana; Cleland, Iain; Goodgame, Neil; ten Hoopen, Petra; Jayathilaka, Suran; Kay, Simon; Leinonen, Rasko; Liu, Xin; Pallreddy, Swapna; Pakseresht, Nima; Rajan, Jeena; Rosselló, Marc; Silvester, Nicole; Smirnov, Dmitriy; Toribio, Ana Luisa; Vaughan, Daniel; Zalunin, Vadim; Cochrane, Guy

    2016-01-01

    The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the submission, maintenance and presentation of nucleotide sequence data and related sample and experimental information. In this article we report on ENA in 2015 regarding general activity, notable published data sets and major achievements. This is followed by a focus on sustainable biocuration of functional annotation, an area which has particularly felt the pressure of sequencing growth. The importance of functional annotation, how it can be submitted and the shifting role of the biocurator in the context of increasing volumes of data are all discussed. PMID:26615190

  20. AIGO: Towards a unified framework for the Analysis and the Inter-comparison of GO functional annotations

    PubMed Central

    2011-01-01

    Background In response to the rapid growth of available genome sequences, efforts have been made to develop automatic inference methods to functionally characterize them. Pipelines that infer functional annotation are now routinely used to produce new annotations at a genome scale and for a broad variety of species. These pipelines differ widely in their inference algorithms, confidence thresholds and data sources for reasoning. This heterogeneity makes a comparison of the relative merits of each approach extremely complex. The evaluation of the quality of the resultant annotations is also challenging given there is often no existing gold-standard against which to evaluate precision and recall. Results In this paper, we present a pragmatic approach to the study of functional annotations. An ensemble of 12 metrics, describing various aspects of functional annotations, is defined and implemented in a unified framework, which facilitates their systematic analysis and inter-comparison. The use of this framework is demonstrated on three illustrative examples: analysing the outputs of state-of-the-art inference pipelines, comparing electronic versus manual annotation methods, and monitoring the evolution of publicly available functional annotations. The framework is part of the AIGO library (http://code.google.com/p/aigo) for the Analysis and the Inter-comparison of the products of Gene Ontology (GO) annotation pipelines. The AIGO library also provides functionalities to easily load, analyse, manipulate and compare functional annotations and also to plot and export the results of the analysis in various formats. Conclusions This work is a step toward developing a unified framework for the systematic study of GO functional annotations. This framework has been designed so that new metrics on GO functional annotations can be added in a very straightforward way. PMID:22054122

  1. OryzaExpress: An Integrated Database of Gene Expression Networks and Omics Annotations in Rice

    PubMed Central

    Hamada, Kazuki; Hongo, Kohei; Suwabe, Keita; Shimizu, Akifumi; Nagayama, Taishi; Abe, Reina; Kikuchi, Shunsuke; Yamamoto, Naoki; Fujii, Takaaki; Yokoyama, Koji; Tsuchida, Hiroko; Sano, Kazumi; Mochizuki, Takako; Oki, Nobuhiko; Horiuchi, Youko; Fujita, Masahiro; Watanabe, Masao; Matsuoka, Makoto; Kurata, Nori; Yano, Kentaro

    2011-01-01

    Similarity of gene expression profiles provides important clues for understanding the biological functions of genes, biological processes and metabolic pathways related to genes. A gene expression network (GEN) is an ideal choice to grasp such expression profile similarities among genes simultaneously. For GEN construction, the Pearson correlation coefficient (PCC) has been widely used as an index to evaluate the similarities of expression profiles for gene pairs. However, calculation of PCCs for all gene pairs requires large amounts of both time and computer resources. Based on correspondence analysis, we developed a new method for GEN construction, which takes minimal time even for large-scale expression data with general computational circumstances. Moreover, our method requires no prior parameters to remove sample redundancies in the data set. Using the new method, we constructed rice GENs from large-scale microarray data stored in a public database. We then collected and integrated various principal rice omics annotations in public and distinct databases. The integrated information contains annotations of genome, transcriptome and metabolic pathways. We thus developed the integrated database OryzaExpress for browsing GENs with an interactive and graphical viewer and principal omics annotations (http://riceball.lab.nig.ac.jp/oryzaexpress/). With integration of Arabidopsis GEN data from ATTED-II, OryzaExpress also allows us to compare GENs between rice and Arabidopsis. Thus, OryzaExpress is a comprehensive rice database that exploits powerful omics approaches from all perspectives in plant science and leads to systems biology. PMID:21186175

  2. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns

    PubMed Central

    Christie, Karen R.; Hong, Eurie L.; Cherry, J. Michael

    2011-01-01

    The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions. PMID:19577472

  3. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci

    PubMed Central

    Trynka, Gosia; Westra, Harm-Jan; Slowikowski, Kamil; Hu, Xinli; Xu, Han; Stranger, Barbara E.; Klein, Robert J.; Han, Buhm; Raychaudhuri, Soumya

    2015-01-01

    Identifying genomic annotations that differentiate causal from trait-associated variants is essential to fine mapping disease loci. Although many studies have identified non-coding functional annotations that overlap disease-associated variants, these annotations often colocalize, complicating the ability to use these annotations for fine mapping causal variation. We developed a statistical approach (Genomic Annotation Shifter [GoShifter]) to assess whether enriched annotations are able to prioritize causal variation. GoShifter defines the null distribution of an annotation overlapping an allele by locally shifting annotations; this approach is less sensitive to biases arising from local genomic structure than commonly used enrichment methods that depend on SNP matching. Local shifting also allows GoShifter to identify independent causal effects from colocalizing annotations. Using GoShifter, we confirmed that variants in expression quantitative trail loci drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites and independently through 3′ UTR regulation. We also showed that (1) 15%–36% of trait-associated loci map to DHSs independently of other annotations; (2) loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies; (3) variants associated with height are highly enriched in embryonic stem cell DHSs; and (4) we can effectively prioritize causal variation at specific loci. PMID:26140449

  4. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    SciTech Connect

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  5. RNAmmer: consistent and rapid annotation of ribosomal RNA genes

    PubMed Central

    Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas; Stærfeldt, Hans-Henrik; Rognes, Torbjørn; Ussery, David W.

    2007-01-01

    The publication of a complete genome sequence is usually accompanied by annotations of its genes. In contrast to protein coding genes, genes for ribosomal RNA (rRNA) are often poorly or inconsistently annotated. This makes comparative studies based on rRNA genes difficult. We have therefore created computational predictors for the major rRNA species from all kingdoms of life and compiled them into a program called RNAmmer. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project. A pre-screening step makes the method fast with little loss of sensitivity, enabling the analysis of a complete bacterial genome in less than a minute. Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy. Novel, unannotated rRNAs are also predicted in many genomes. The software as well as the genome analysis results are available at the CBS web server. PMID:17452365

  6. RNAmmer: consistent and rapid annotation of ribosomal RNA genes.

    PubMed

    Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas; Staerfeldt, Hans-Henrik; Rognes, Torbjørn; Ussery, David W

    2007-01-01

    The publication of a complete genome sequence is usually accompanied by annotations of its genes. In contrast to protein coding genes, genes for ribosomal RNA (rRNA) are often poorly or inconsistently annotated. This makes comparative studies based on rRNA genes difficult. We have therefore created computational predictors for the major rRNA species from all kingdoms of life and compiled them into a program called RNAmmer. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project. A pre-screening step makes the method fast with little loss of sensitivity, enabling the analysis of a complete bacterial genome in less than a minute. Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy. Novel, unannotated rRNAs are also predicted in many genomes. The software as well as the genome analysis results are available at the CBS web server. PMID:17452365

  7. The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context.

    PubMed

    Ruepp, Andreas; Doudieu, Octave Noubibou; van den Oever, Jos; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Fobo, Gisela; Frishman, Goar; Montrone, Corinna; Skornia, Christine; Wanka, Steffi; Rattei, Thomas; Pagel, Philipp; Riley, Louise; Frishman, Dmitrij; Surmeli, Dimitrij; Tetko, Igor V; Oesterheld, Matthias; Stümpflen, Volker; Mewes, H Werner

    2006-01-01

    MfunGD (http://mips.gsf.de/genre/proj/mfungd/) provides a resource for annotated mouse proteins and their occurrence in protein networks. Manual annotation concentrates on proteins which are found to interact physically with other proteins. Accordingly, manually curated information from a protein-protein interaction database (MPPI) and a database of mammalian protein complexes is interconnected with MfunGD. Protein function annotation is performed using the Functional Catalogue (FunCat) annotation scheme which is widely used for the analysis of protein networks. The dataset is also supplemented with information about the literature that was used in the annotation process as well as links to the SIMAP Fasta database, the Pedant protein analysis system and cross-references to external resources. Proteins that so far were not manually inspected are annotated automatically by a graphical probabilistic model and/or superparamagnetic clustering. The database is continuously expanding to include the rapidly growing amount of functional information about gene products from mouse. MfunGD is implemented in GenRE, a J2EE-based component-oriented multi-tier architecture following the separation of concern principle. PMID:16381934

  8. Identification of sample annotation errors in gene expression datasets.

    PubMed

    Lohr, Miriam; Hellwig, Birte; Edlund, Karolina; Mattsson, Johanna S M; Botling, Johan; Schmidt, Marcus; Hengstler, Jan G; Micke, Patrick; Rahnenführer, Jörg

    2015-12-01

    The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data. PMID:26608184

  9. TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

    PubMed Central

    Lorenzi, Hernan A.; Hoover, Jeff; Inman, Jason; Safford, Todd; Murphy, Sean; Kagan, Leonid; Williamson, Shannon J.

    2011-01-01

    In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence. PMID:21886867

  10. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database

    PubMed Central

    Drabkin, Harold J.; Blake, Judith A.

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications. PMID:23110975

  11. Assessment of protein set coherence using functional annotations

    PubMed Central

    Chagoyen, Monica; Carazo, Jose M; Pascual-Montano, Alberto

    2008-01-01

    Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at PMID:18937846

  12. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation

    PubMed Central

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2015-01-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs.

  13. Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation.

    PubMed

    Rosenfeld, Jeffrey; Foox, Jonathan; DeSalle, Rob

    2016-03-01

    Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs. PMID:26862572

  14. Use of Gene Ontology Annotation to understand the peroxisome proteome in humans.

    PubMed

    Mutowo-Meullenet, Prudence; Huntley, Rachael P; Dimmer, Emily C; Alam-Faruque, Yasmin; Sawford, Tony; Jesus Martin, Maria; O'Donovan, Claire; Apweiler, Rolf

    2013-01-01

    The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org. PMID:23327938

  15. Pegasus: a comprehensive annotation and prediction tool for detection of driver gene fusions in cancer

    PubMed Central

    2014-01-01

    Background The extraordinary success of imatinib in the treatment of BCR-ABL1 associated cancers underscores the need to identify novel functional gene fusions in cancer. RNA sequencing offers a genome-wide view of expressed transcripts, uncovering biologically functional gene fusions. Although several bioinformatics tools are already available for the detection of putative fusion transcripts, candidate event lists are plagued with non-functional read-through events, reverse transcriptase template switching events, incorrect mapping, and other systematic errors. Such lists lack any indication of oncogenic relevance, and they are too large for exhaustive experimental validation. Results We have designed and implemented a pipeline, Pegasus, for the annotation and prediction of biologically functional gene fusion candidates. Pegasus provides a common interface for various gene fusion detection tools, reconstruction of novel fusion proteins, reading-frame-aware annotation of preserved/lost functional domains, and data-driven classification of oncogenic potential. Pegasus dramatically streamlines the search for oncogenic gene fusions, bridging the gap between raw RNA-Seq data and a final, tractable list of candidates for experimental validation. Conclusion We show the effectiveness of Pegasus in predicting new driver fusions in 176 RNA-Seq samples of glioblastoma multiforme (GBM) and 23 cases of anaplastic large cell lymphoma (ALCL). Contact: fa2306@columbia.edu. PMID:25183062

  16. Gene Model Annotations for Drosophila melanogaster: The Rule-Benders

    PubMed Central

    Crosby, Madeline A.; Gramates, L. Sian; dos Santos, Gilberto; Matthews, Beverley B.; St. Pierre, Susan E.; Zhou, Pinglei; Schroeder, Andrew J.; Falls, Kathleen; Emmert, David B.; Russo, Susan M.; Gelbart, William M.

    2015-01-01

    In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified in the course of FlyBase analysis. These range from atypical but common examples such as dicistronic and polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, and stop-codon readthroughs, to single exceptional cases such as ribosomal frameshifting and HAC1-type intron processing. In FlyBase, exceptional genes and transcripts are flagged with Sequence Ontology terms and/or standardized comments. Because some of the rule-benders create problems for handlers of high-throughput data, we discuss plans for flagging these cases in bulk data downloads. PMID:26109356

  17. Optimizing high performance computing workflow for protein functional annotation

    PubMed Central

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-01-01

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  18. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  19. miRDB: an online resource for microRNA target prediction and functional annotations

    PubMed Central

    Wong, Nathan; Wang, Xiaowei

    2015-01-01

    MicroRNAs (miRNAs) are small non-coding RNAs that are extensively involved in many physiological and disease processes. One major challenge in miRNA studies is the identification of genes regulated by miRNAs. To this end, we have developed an online resource, miRDB (http://mirdb.org), for miRNA target prediction and functional annotations. Here, we describe recently updated features of miRDB, including 2.1 million predicted gene targets regulated by 6709 miRNAs. In addition to presenting precompiled prediction data, a new feature is the web server interface that allows submission of user-provided sequences for miRNA target prediction. In this way, users have the flexibility to study any custom miRNAs or target genes of interest. Another major update of miRDB is related to functional miRNA annotations. Although thousands of miRNAs have been identified, many of the reported miRNAs are not likely to play active functional roles or may even have been falsely identified as miRNAs from high-throughput studies. To address this issue, we have performed combined computational analyses and literature mining, and identified 568 and 452 functional miRNAs in humans and mice, respectively. These miRNAs, as well as associated functional annotations, are presented in the FuncMir Collection in miRDB. PMID:25378301

  20. An Approach to Function Annotation for Proteins of Unknown Function (PUFs) in the Transcriptome of Indian Mulberry

    PubMed Central

    Dhanyalakshmi, K. H.; Naika, Mahantesha B. N.; Sajeevan, R. S.; Mathew, Oommen K.; Shafi, K. Mohamed; Sowdhamini, Ramanathan; N. Nataraja, Karaba

    2016-01-01

    The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs). Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS), which also provides a web service API (Application Programming Interface) for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas. PMID:26982336

  1. Transcriptomal changes and functional annotation of the developing non-human primate choroid plexus

    PubMed Central

    Ek, C. Joakim; Nathanielsz, Peter; Li, Cun; Mallard, Carina

    2015-01-01

    The choroid plexuses are small organs that protrude into each brain ventricle producing cerebrospinal fluid that constantly bathes the brain. These organs differentiate early in development just after neural closure at a stage when the brain is little vascularized. In recent years the plexus has been shown to have a much more active role in brain development than previously appreciated thereby it can influence both neurogenesis and neural migration by secreting factors into the CSF. However, much of choroid plexus developmental function is still unclear. Most previous studies on this organ have been undertaken in rodents but translation into humans is not straightforward since they have a different timing of brain maturation processes. We have collected choroid plexus from three fetal gestational ages of a non-human primate, the baboon, which has much closer brain development to humans. The transcriptome of the plexuses was determined by next generation sequencing and Ingenuity Pathway Analysis software was used to annotate functions and enrichment of pathways of changes in the transcriptome. The number of unique transcripts decreased with development and the majority of differentially expressed transcripts were down-regulated through development suggesting a more complex and active plexus earlier in fetal development. The functional annotation indicated changes across widespread biological functions in plexus development. In particular we find age-dependent regulation of genes associated with annotation categories: Gene Expression, Development of Cardiovascular System, Nervous System Development and Molecular Transport. Our observations support the idea that the choroid plexus has roles in shaping brain development. PMID:25814924

  2. Functional annotation of the human chromosome 7 "missing" proteins: a bioinformatics approach.

    PubMed

    Ranganathan, Shoba; Khan, Javed M; Garg, Gagan; Baker, Mark S

    2013-06-01

    The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered "missing" in this study as they lack experimental support. We have developed a protocol for the functional annotation of these "missing" proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 "missing" proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the "missing" chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data. PMID:23308364

  3. Protein Function Annotation By Local Binding Site Surface Similarity

    PubMed Central

    Spitzer, Russell; Cleves, Ann E.; Varela, Rocco; Jain, Ajay N.

    2013-01-01

    Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against approximately 60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that pre-dated query protein biochemical annotation for five out of the eight query proteins. A panel of twelve currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins. PMID:24166661

  4. Computational analysis of transcriptome of Indian major carp, Labeo rohita (Hamilton-Buchanan, 1822) for functional annotation

    PubMed Central

    Nagpure, Naresh Sahebrao; Rashid, Iliyas; Pathak, Ajey Kumar; Singh, Mahender; Singh, Shri Prakash; Sarkar, Uttam Kumar

    2012-01-01

    A total of 1671 ESTs of Labeo rohita were retrieved from dbEST database and analysed for functional annotation using various computational approaches. The result indicated 1387 non-redundant (184 contigs and 1203 singletons) putative transcripts with an average length of 542 bp. These 1387 transcript sequences were matched with Refseq_RNA, UniGene and Swiss-Prot on high threshold cut-off for functional annotation along with help of gene ontology and SSRs markers. We developed extensive Perl programming based modules for processing all alignment files, comparing and extracting common hits from all files on a threshold, evaluating statistics for alignment results and assigning gene ontology terms. In this study, 92 putative transcripts predicted as orthologous genes and among those, 44 putative transcripts were annotated with gene ontology terms. The annotated orthologous gene of our result associated with some very important proteins of L. rohita involved in biotic and abiotic stresses and glucose metabolism of spermatogenic cells etc. The unidentified transcripts, if found important in expression profiling can be vital resource after re-sequencing. The predicted genes can further be used for enhancing productivity and controlling disease of L. rohita. PMID:23275698

  5. Combining heterogeneous data sources for accurate functional annotation of proteins

    PubMed Central

    2013-01-01

    Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net. PMID:23514123

  6. FACT: Functional annotation transfer between proteins with similar feature architectures

    PubMed Central

    2010-01-01

    Background The increasing number of sequenced genomes provides the basis for exploring the genetic and functional diversity within the tree of life. Only a tiny fraction of the encoded proteins undergoes a thorough experimental characterization. For the remainder, bioinformatics annotation tools are the only means to infer their function. Exploiting significant sequence similarities to already characterized proteins, commonly taken as evidence for homology, is the prevalent method to deduce functional equivalence. Such methods fail when homologs are too diverged, or when they have assumed a different function. Finally, due to convergent evolution, functional equivalence is not necessarily linked to common ancestry. Therefore complementary approaches are required to identify functional equivalents. Results We present the Feature Architecture Comparison Tool http://www.cibiv.at/FACT to search for functionally equivalent proteins. FACT uses the similarity between feature architectures of two proteins, i.e., the arrangements of functional domains, secondary structure elements and compositional properties, as a proxy for their functional equivalence. A scoring function measures feature architecture similarities, which enables searching for functional equivalents in entire proteomes. Our evaluation of 9,570 EC classified enzymes revealed that FACT, using the full feature, set outperformed the existing architecture-based approaches by identifying significantly more functional equivalents as highest scoring proteins. We show that FACT can identify functional equivalents that share no significant sequence similarity. However, when the highest scoring protein of FACT is also the protein with the highest local sequence similarity, it is in 99% of the cases functionally equivalent to the query. We demonstrate the versatility of FACT by identifying a missing link in the yeast glutathione metabolism and also by searching for the human GolgA5 equivalent in Trypanosoma brucei. Conclusions FACT facilitates a quick and sensitive search for functionally equivalent proteins in entire proteomes. FACT is complementary to approaches using sequence similarity to identify proteins with the same function. Thus, FACT is particularly useful when functional equivalents need to be identified in evolutionarily distant species, or when functional equivalents are not homologous. The most reliable annotation transfers, however, are achieved when feature architecture similarity and sequence similarity are jointly taken into account. PMID:20696036

  7. CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations

    PubMed Central

    2013-01-01

    Background In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations. Results In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGCunder the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0). Conclusions The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations. PMID:24225062

  8. The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

    PubMed Central

    Yu, Chenggang; Zavaljevski, Nela; Desai, Valmik; Johnson, Seth; Stevens, Fred J; Reifman, Jaques

    2008-01-01

    Background Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used. Conclusion The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources. PMID:18221520

  9. Functional annotation of intrinsically disordered domains by their amino acid content using IDD Navigator.

    PubMed

    Patil, Ashwini; Teraguchi, Shunsuke; Dinh, Huy; Nakai, Kenta; Standley, Daron M

    2012-01-01

    Function prediction of intrinsically disordered domains (IDDs) using sequence similarity methods is limited by their high mutability and prevalence of low complexity regions. We describe a novel method for identifying similar IDDs by a similarity metric based on amino acid composition and identify significantly overrepresented Gene Ontology (GO) and Pfam domain annotations within highly similar IDDs. Applications and extensions of the proposed method are discussed, in particular with respect to protein functional annotation. We test the predicted annotations in a large-scale survey of IDDs in mouse and find that the proposed method provides significantly greater protein coverage in terms of function prediction than traditional sequence alignment methods like BLAST. As a proof of concept we examined several disorder-containing proteins: GRA15 and ROP16, both encoded in the parasitic protozoa T. gondii; Cyclon, a mostly uncharacterized protein involved in the regulation of immune cell death; STIM1, a protein essential for regulating calcium levels in the endoplasmic reticulum. We show that the overrepresented GO terms are consistent with recently-reported biological functions. We implemented the method in the web server IDD Navigator. IDD Navigator is available at http://sysimm.ifrec.osaka-u.ac.jp/disorder/beta.php. PMID:22174272

  10. Functional Annotation of Putative Regulatory Elements at Cancer Susceptibility Loci

    PubMed Central

    Rosse, Stephanie A; Auer, Paul L; Carlson, Christopher S

    2014-01-01

    Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions. PMID:25288875

  11. Mining GO Annotations for Improving Annotation Consistency

    PubMed Central

    Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, António E. N.; Albrecht, Mario; Falcão, André O.

    2012-01-01

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%. PMID:22848383

  12. A Syngeneic Variance Library for Functional Annotation of Human Variation: Application to BRCA2

    PubMed Central

    Hucl, Tomas; Rago, Carlo; Gallmeier, Eike; Brody, Jonathan R.; Gorospe, Myriam; Kern, Scott E.

    2008-01-01

    The enormous scope of natural human genetic variation is now becoming defined. To accurately annotate these variants, and to identify those with clinical importance, is often difficult to assess through functional assays. We explored systematic annotation by using homologous recombination to modify a native gene in hemizygous (wt/?exon) human cancer cells, generating a novel syngeneic variance library (SyVaL). We created a SyVaL of BRCA2 variants: nondeleterious, proposed deleterious, deleterious, and of uncertain significance. We found that the null states BRCA2?ex11/?ex11 and BRCA2?ex11/?3308X were deleterious as assessed by a loss of RAD51 focus formation on genotoxic damage and by acquisition of toxic hypersensitivity to mitomycin C and etoposide, whereas BRCA2?ex11/Y3308Y, BRCA2?ex11/P3292L, and BRCA2?ex11/P3280H had wild-type function. A proposed phosphorylation site at codon 3291 affecting function was confirmed by substitution of an acidic residue (glutamate, BRCA2?ex11/S3291E) for the native serine, but in contrast to a prior report, phosphorylation was dispensable (alanine, BRCA2?ex11/S3291A) for BRCA2-governed cellular phenotypes. These results show that SyVaLs offer a means to comprehensively annotate gene function, facilitating numerical and unambiguous readouts. SyVaLs may be especially useful for genes in which functional assays using exogenous expression are toxic or otherwise unreliable. They also offer a stable, distributable cellular resource for further research. PMID:18593900

  13. A New Strategy to Identify and Annotate Human RPE-Specific Gene Expression

    PubMed Central

    Booij, Judith C.; ten Brink, Jacoline B.; Swagemakers, Sigrid M. A.; Verkerk, Annemieke J. M. H.; Essing, Anke H. W.; van der Spek, Peter J.; Bergen, Arthur A. B.

    2010-01-01

    Background To identify and functionally annotate cell type-specific gene expression in the human retinal pigment epithelium (RPE), a key tissue involved in age-related macular degeneration and retinitis pigmentosa. Methodology RPE, photoreceptor and choroidal cells were isolated from selected freshly frozen healthy human donor eyes using laser microdissection. RNA isolation, amplification and hybridization to 44 k microarrays was carried out according to Agilent specifications. Bioinformatics was carried out using Rosetta Resolver, David and Ingenuity software. Principal Findings Our previous 22 k analysis of the RPE transcriptome showed that the RPE has high levels of protein synthesis, strong energy demands, is exposed to high levels of oxidative stress and a variable degree of inflammation. We currently use a complementary new strategy aimed at the identification and functional annotation of RPE-specific expressed transcripts. This strategy takes advantage of the multilayered cellular structure of the retina and overcomes a number of limitations of previous studies. In triplicate, we compared the transcriptomes of RPE, photoreceptor and choroidal cells and we deduced RPE specific expression. We identified at least 114 entries with RPE-specific gene expression. Thirty-nine of these 114 genes also show high expression in the RPE, comparison with the literature showed that 85% of these 39 were previously identified to be expressed in the RPE. In the group of 114 RPE specific genes there was an overrepresentation of genes involved in (membrane) transport, vision and ophthalmic disease. More fundamentally, we found RPE-specific involvement in the RAR-activation, retinol metabolism and GABA receptor signaling pathways. Conclusions In this study we provide a further specification and understanding of the RPE transcriptome by identifying and analyzing genes that are specifically expressed in the RPE. PMID:20479888

  14. FastAnnotator- an efficient transcript annotation web tool

    PubMed Central

    2012-01-01

    Background Recent developments in high-throughput sequencing (HTS) technologies have made it feasible to sequence the complete transcriptomes of non-model organisms or metatranscriptomes from environmental samples. The challenge after generating hundreds of millions of sequences is to annotate these transcripts and classify the transcripts based on their putative functions. Because many biological scientists lack the knowledge to install Linux-based software packages or maintain databases used for transcript annotation, we developed an automatic annotation tool with an easy-to-use interface. Methods To elucidate the potential functions of gene transcripts, we integrated well-established annotation tools: Blast2GO, PRIAM and RPS BLAST in a web-based service, FastAnnotator, which can assign Gene Ontology (GO) terms, Enzyme Commission numbers (EC numbers) and functional domains to query sequences. Results Using six transcriptome sequence datasets as examples, we demonstrated the ability of FastAnnotator to assign functional annotations. FastAnnotator annotated 88.1% and 81.3% of the transcripts from the well-studied organisms Caenorhabditis elegans and Streptococcus parasanguinis, respectively. Furthermore, FastAnnotator annotated 62.9%, 20.4%, 53.1% and 42.0% of the sequences from the transcriptomes of sweet potato, clam, amoeba, and Trichomonas vaginalis, respectively, which lack reference genomes. We demonstrated that FastAnnotator can complete the annotation process in a reasonable amount of time and is suitable for the annotation of transcriptomes from model organisms or organisms for which annotated reference genomes are not avaiable. Conclusions The sequencing process no longer represents the bottleneck in the study of genomics, and automatic annotation tools have become invaluable as the annotation procedure has become the limiting step. We present FastAnnotator, which was an automated annotation web tool designed to efficiently annotate sequences with their gene functions, enzyme functions or domains. FastAnnotator is useful in transcriptome studies and especially for those focusing on non-model organisms or metatranscriptomes. FastAnnotator does not require local installation and is freely available at http://fastannotator.cgu.edu.tw. PMID:23281853

  15. RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates

    PubMed Central

    Rajput, Bhanu; Murphy, Terence D.; Pruitt, Kim D.

    2015-01-01

    Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records. PMID:26170238

  16. Systematic functional genomics resource and annotation for poplar.

    PubMed

    Si, Jingna; Zhao, Xiyang; Zhao, Xinyin; Wu, Rongling

    2015-08-01

    Poplar, as a model species for forestry research, has many excellent characteristics. Studies on functional genes have provided the foundation, at the molecular level, for improving genetic traits and cultivating elite lines. Although studies on functional genes have been performed for many years, large amounts of experimental data remain scattered across various reports and have not been unified via comprehensive statistical analysis. This problem can be addressed by employing bioinformatic methodology and technology to gather and organise data to construct a Poplar Functional Gene Database, containing data on 207 poplar functional genes. As an example, the authors investigated genes of Populus euphratica involved in the response to salt stress. Four small cDNA libraries were constructed and treated with 300 mM NaCl or pure water for 6 and 24 h. Using high-throughput sequencing, they identified conserved and novel miRNAs that were differentially expressed. Target genes were next predicted and detailed functional information derived using the Gene Ontology database and Kyoto Encyclopedia of Genes and Genomes pathway analysis. This information provides a primary visual schema allowing us to understand the dynamics of the regulatory gene network responding to salt stress in Populus. PMID:26243833

  17. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    PubMed Central

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  18. The Gene3D Web Services: a platform for identifying, annotating and comparing structural domains in protein sequences

    PubMed Central

    Yeats, Corin; Lees, Jonathan; Carter, Phil; Sillitoe, Ian; Orengo, Christine

    2011-01-01

    The Gene3D structural domain database provides domain annotations for 7 million proteins, based on the manually curated structural domain superfamilies in CATH. These annotations are integrated with functional, genomic and molecular information from external resources, such as GO, EC, UniProt and the NCBI Taxonomy database. We have constructed a set of web services that provide programmatic access to this integrated database, as well as the Gene3D domain recognition tool (Gene3DScan) and protein sequence annotation pipeline for analysing novel protein sequences. Example queries include retrieving all curated GO terms for a domain superfamily or all the multi-domain architectures for the human genome. The services can be accessed using simple HTTP calls and are able to return results in a range of formats for quick downloading and easy parsing, graphical rendering and data storage. Hence, they provide a simple, but flexible means of integrating domain annotations and associated data sets into locally run pipelines and analysis software. The services can be found at http://gene3d.biochem.ucl.ac.uk/WebServices/. PMID:21646335

  19. The Gene3D Web Services: a platform for identifying, annotating and comparing structural domains in protein sequences.

    PubMed

    Yeats, Corin; Lees, Jonathan; Carter, Phil; Sillitoe, Ian; Orengo, Christine

    2011-07-01

    The Gene3D structural domain database provides domain annotations for 7 million proteins, based on the manually curated structural domain superfamilies in CATH. These annotations are integrated with functional, genomic and molecular information from external resources, such as GO, EC, UniProt and the NCBI Taxonomy database. We have constructed a set of web services that provide programmatic access to this integrated database, as well as the Gene3D domain recognition tool (Gene3DScan) and protein sequence annotation pipeline for analysing novel protein sequences. Example queries include retrieving all curated GO terms for a domain superfamily or all the multi-domain architectures for the human genome. The services can be accessed using simple HTTP calls and are able to return results in a range of formats for quick downloading and easy parsing, graphical rendering and data storage. Hence, they provide a simple, but flexible means of integrating domain annotations and associated data sets into locally run pipelines and analysis software. The services can be found at http://gene3d.biochem.ucl.ac.uk/WebServices/. PMID:21646335

  20. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  1. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction

    PubMed Central

    2015-01-01

    Background A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants. PMID:26110515

  2. GeneSense: a new approach for human gene annotation integrated with protein-protein interaction networks

    PubMed Central

    Chen, Zhongzhong; Zhang, Tianhong; Lin, Jun; Yan, Zidan; Wang, Yongren; Zheng, Weiqiang; Weng, Kevin C.

    2014-01-01

    Virtually all cellular functions involve protein-protein interactions (PPIs). As an increasing number of PPIs are identified and vast amount of information accumulated, researchers are finding different ways to interrogate the data and understand the interactions in context. However, it is widely recognized that a significant portion of the data is scattered, redundant, not considered high quality, and not readily accessible to researchers in a systematic fashion. In addition, it is challenging to identify the optimal protein targets in the current PPI networks. The GeneSense server was developed to integrate gene annotation and PPI networks in an expandable architecture that incorporates selected databases with the aim to assemble, analyze, evaluate and disseminate protein-protein association information in a comprehensive and user-friendly manner. Three network models including nodenet, leafnet and loopnet are used to identify the optimal protein targets in the complex networks. GeneSense is freely available at www.biomedsense.org/genesense.php. PMID:24667292

  3. Gene Annotation and Drug Target Discovery in Candida albicans with a Tagged Transposon Mutant Collection

    PubMed Central

    Oh, Julia; Fung, Eula; Schlecht, Ulrich; Davis, Ronald W.; Giaever, Guri; St. Onge, Robert P.; Deutschbauer, Adam; Nislow, Corey

    2010-01-01

    Candida albicans is the most common human fungal pathogen, causing infections that can be lethal in immunocompromised patients. Although Saccharomyces cerevisiae has been used as a model for C. albicans, it lacks C. albicans' diverse morphogenic forms and is primarily non-pathogenic. Comprehensive genetic analyses that have been instrumental for determining gene function in S. cerevisiae are hampered in C. albicans, due in part to limited resources to systematically assay phenotypes of loss-of-function alleles. Here, we constructed and screened a library of 3633 tagged heterozygous transposon disruption mutants, using them in a competitive growth assay to examine nutrient- and drug-dependent haploinsufficiency. We identified 269 genes that were haploinsufficient in four growth conditions, the majority of which were condition-specific. These screens identified two new genes necessary for filamentous growth as well as ten genes that function in essential processes. We also screened 57 chemically diverse compounds that more potently inhibited growth of C. albicans versus S. cerevisiae. For four of these compounds, we examined the genetic basis of this differential inhibition. Notably, Sec7p was identified as the target of brefeldin A in C. albicans screens, while S. cerevisiae screens with this compound failed to identify this target. We also uncovered a new C. albicans-specific target, Tfp1p, for the synthetic compound 0136-0228. These results highlight the value of haploinsufficiency screens directly in this pathogen for gene annotation and drug target identification. PMID:20949076

  4. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    PubMed Central

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  5. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  6. Approaching the Functional Annotation of Fungal Virulence Factors Using Cross-Species Genetic Interaction Profiling

    PubMed Central

    Brown, Jessica C. S.; Madhani, Hiten D.

    2012-01-01

    In many human fungal pathogens, genes required for disease remain largely unannotated, limiting the impact of virulence gene discovery efforts. We tested the utility of a cross-species genetic interaction profiling approach to obtain clues to the molecular function of unannotated pathogenicity factors in the human pathogen Cryptococcus neoformans. This approach involves expression of C. neoformans genes of interest in each member of the Saccharomyces cerevisiae gene deletion library, quantification of their impact on growth, and calculation of the cross-species genetic interaction profiles. To develop functional predictions, we computed and analyzed the correlations of these profiles with existing genetic interaction profiles of S. cerevisiae deletion mutants. For C. neoformans LIV7, which has no S. cerevisiae ortholog, this profiling approach predicted an unanticipated role in the Golgi apparatus. Validation studies in C. neoformans demonstrated that Liv7 is a functional Golgi factor where it promotes the suppression of the exposure of a specific immunostimulatory molecule, mannose, on the cell surface, thereby inhibiting phagocytosis. The genetic interaction profile of another pathogenicity gene that lacks an S. cerevisiae ortholog, LIV6, strongly predicted a role in endosome function. This prediction was also supported by studies of the corresponding C. neoformans null mutant. Our results demonstrate the utility of quantitative cross-species genetic interaction profiling for the functional annotation of fungal pathogenicity proteins of unknown function including, surprisingly, those that are not conserved in sequence across fungi. PMID:23300468

  7. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism. PMID:26196387

  8. Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.).

    PubMed

    Zhang, Guangyu; Pan, Linlin; Yin, Yuxin; Liu, Wanfei; Huang, Dawei; Zhang, Tongwu; Wang, Lei; Xin, Chengqi; Lin, Qiang; Sun, Gaoyuan; Ba Abdullah, Mohammed M; Zhang, Xiaowei; Hu, Songnian; Al-Mssallem, Ibrahim S; Yu, Jun

    2012-08-01

    The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads-about one million sequencing reads per library-and the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity. PMID:22736259

  9. Initiating the mollusk genomics annotation community: toward creating the complete curated gene-set of the Japanese Pearl Oyster, Pinctada fucata.

    PubMed

    Kawashima, Takeshi; Takeuchi, Takeshi; Koyanagi, Ryo; Kinoshita, Shigeharu; Endo, Hirotoshi; Endo, Kazuyoshi

    2013-10-01

    The genome sequence of the Japanese pearl oyster, the first draft genome from a mollusk, was published in February 2012. In order to curate the draft genome assemblies and annotate the predicted gene models, two annotation Jamborees were held in Okinawa and Tokyo. To date, 761 genes have been surveyed and curated. A preparatory meeting and a debriefing were held at the Misaki Marine Biological Station before and after the Jamborees. These four events, in conjunction with the sequence-decoding project, have facilitated the first series of gene annotations. Genome annotators among the Jamboree participants added 22 functional categories to the annotation system to date. Of these, 17 are included in Generic Gene Ontology. The other five categories are specific to molluskan biology, such as "Byssus Formation" and "Shell Formation", including Biomineralization and Acidic Proteins. A total of 731 genes from our latest version of gene models are annotated and classified into these 22 categories. The resulting data will serve as a useful reference for future genomic analyses of this species as well as comparative analyses among mollusks. PMID:24125643

  10. High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method.

    PubMed

    Li, Wenyuan; Kang, Shuli; Liu, Chun-Chi; Zhang, Shihua; Shi, Yi; Liu, Yan; Zhou, Xianghong Jasmine

    2014-04-01

    Alternative transcript processing is an important mechanism for generating functional diversity in genes. However, little is known about the precise functions of individual isoforms. In fact, proteins (translated from transcript isoforms), not genes, are the function carriers. By integrating multiple human RNA-seq data sets, we carried out the first systematic prediction of isoform functions, enabling high-resolution functional annotation of human transcriptome. Unlike gene function prediction, isoform function prediction faces a unique challenge: the lack of the training data--all known functional annotations are at the gene level. To address this challenge, we modelled the gene-isoform relationships as multiple instance data and developed a novel label propagation method to predict functions. Our method achieved an average area under the receiver operating characteristic curve of 0.67 and assigned functions to 15 572 isoforms. Interestingly, we observed that different functions have different sensitivities to alternative isoform processing, and that the function diversity of isoforms from the same gene is positively correlated with their tissue expression diversity. Finally, we surveyed the literature to validate our predictions for a number of apoptotic genes. Strikingly, for the famous 'TP53' gene, we not only accurately identified the apoptosis regulation function of its five isoforms, but also correctly predicted the precise direction of the regulation. PMID:24369432

  11. A Novel Method for Functional Annotation Prediction Based on Combination of Classification Methods

    PubMed Central

    Jung, Jaehee; Lee, Heung Ki

    2014-01-01

    Automated protein function prediction defines the designation of functions of unknown protein functions by using computational methods. This technique is useful to automatically assign gene functional annotations for undefined sequences in next generation genome analysis (NGS). NGS is a popular research method since high-throughput technologies such as DNA sequencing and microarrays have created large sets of genes. These huge sequences have greatly increased the need for analysis. Previous research has been based on the similarities of sequences as this is strongly related to the functional homology. However, this study aimed to designate protein functions by automatically predicting the function of the genome by utilizing InterPro (IPR), which can represent the properties of the protein family and groups of the protein function. Moreover, we used gene ontology (GO), which is the controlled vocabulary used to comprehensively describe the protein function. To define the relationship between IPR and GO terms, three pattern recognition techniques have been employed under different conditions, such as feature selection and weighted value, instead of a binary one. PMID:25133242

  12. Reduce Manual Curation by Combining Gene Predictions from Multiple Annotation Engines, a Case Study of Start Codon Prediction

    PubMed Central

    Ederveen, Thomas H. A.; Overmars, Lex; van Hijum, Sacha A. F. T.

    2013-01-01

    Nowadays, prokaryotic genomes are sequenced faster than the capacity to manually curate gene annotations. Automated genome annotation engines provide users a straight-forward and complete solution for predicting ORF coordinates and function. For many labs, the use of AGEs is therefore essential to decrease the time necessary for annotating a given prokaryotic genome. However, it is not uncommon for AGEs to provide different and sometimes conflicting predictions. Combining multiple AGEs might allow for more accurate predictions. Here we analyzed the ab initio open reading frame (ORF) calling performance of different AGEs based on curated genome annotations of eight strains from different bacterial species with GC% ranging from 35–52%. We present a case study which demonstrates a novel way of comparative genome annotation, using combinations of AGEs in a pre-defined order (or path) to predict ORF start codons. The order of AGE combinations is from high to low specificity, where the specificity is based on the eight genome annotations. For each AGE combination we are able to derive a so-called projected confidence value, which is the average specificity of ORF start codon prediction based on the eight genomes. The projected confidence enables estimating likeliness of a correct prediction for a particular ORF start codon by a particular AGE combination, pinpointing ORFs notoriously difficult to predict start codons. We correctly predict start codons for 90.5±4.8% of the genes in a genome (based on the eight genomes) with an accuracy of 81.1±7.6%. Our consensus-path methodology allows a marked improvement over majority voting (9.7±4.4%) and with an optimal path ORF start prediction sensitivity is gained while maintaining a high specificity. PMID:23675487

  13. Genome-wide Annotation, Identification, and Global Transcriptomic Analysis of Regulatory or Small RNA Gene Expression in Staphylococcus aureus

    PubMed Central

    Weiss, Andy; Broach, William H.; Wiemels, Richard E.; Mogen, Austin B.; Rice, Kelly C.

    2016-01-01

    ABSTRACT In Staphylococcus aureus, hundreds of small regulatory or small RNAs (sRNAs) have been identified, yet this class of molecule remains poorly understood and severely understudied. sRNA genes are typically absent from genome annotation files, and as a consequence, their existence is often overlooked, particularly in global transcriptomic studies. To facilitate improved detection and analysis of sRNAs in S. aureus, we generated updated GenBank files for three commonly used S. aureus strains (MRSA252, NCTC 8325, and USA300), in which we added annotations for >260 previously identified sRNAs. These files, the first to include genome-wide annotation of sRNAs in S. aureus, were then used as a foundation to identify novel sRNAs in the community-associated methicillin-resistant strain USA300. This analysis led to the discovery of 39 previously unidentified sRNAs. Investigating the genomic loci of the newly identified sRNAs revealed a surprising degree of inconsistency in genome annotation in S. aureus, which may be hindering the analysis and functional exploration of these elements. Finally, using our newly created annotation files as a reference, we perform a global analysis of sRNA gene expression in S. aureus and demonstrate that the newly identified tsr25 is the most highly upregulated sRNA in human serum. This study provides an invaluable resource to the S. aureus research community in the form of our newly generated annotation files, while at the same time presenting the first examination of differential sRNA expression in pathophysiologically relevant conditions. PMID:26861020

  14. High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method

    PubMed Central

    Li, Wenyuan; Kang, Shuli; Liu, Chun-Chi; Zhang, Shihua; Shi, Yi; Liu, Yan; Zhou, Xianghong Jasmine

    2014-01-01

    Alternative transcript processing is an important mechanism for generating functional diversity in genes. However, little is known about the precise functions of individual isoforms. In fact, proteins (translated from transcript isoforms), not genes, are the function carriers. By integrating multiple human RNA-seq data sets, we carried out the first systematic prediction of isoform functions, enabling high-resolution functional annotation of human transcriptome. Unlike gene function prediction, isoform function prediction faces a unique challenge: the lack of the training data—all known functional annotations are at the gene level. To address this challenge, we modelled the gene–isoform relationships as multiple instance data and developed a novel label propagation method to predict functions. Our method achieved an average area under the receiver operating characteristic curve of 0.67 and assigned functions to 15 572 isoforms. Interestingly, we observed that different functions have different sensitivities to alternative isoform processing, and that the function diversity of isoforms from the same gene is positively correlated with their tissue expression diversity. Finally, we surveyed the literature to validate our predictions for a number of apoptotic genes. Strikingly, for the famous ‘TP53’ gene, we not only accurately identified the apoptosis regulation function of its five isoforms, but also correctly predicted the precise direction of the regulation. PMID:24369432

  15. SpectroGene: A Tool for Proteogenomic Annotations Using Top-Down Spectra.

    PubMed

    Kolmogorov, Mikhail; Liu, Xiaowen; Pevzner, Pavel A

    2016-01-01

    In the past decade, proteogenomics has emerged as a valuable technique that contributes to the state-of-the-art in genome annotation; however, previous proteogenomic studies were limited to bottom-up mass spectrometry and did not take advantage of top-down approaches. We show that top-down proteogenomics allows one to address the problems that remained beyond the reach of traditional bottom-up proteogenomics. In particular, we show that top-down proteogenomics leads to the discovery of previously unannotated genes even in extensively studied bacterial genomes and present SpectroGene, a software tool for genome annotation using top-down tandem mass spectra. We further show that top-down proteogenomics searches (against the six-frame translation of a genome) identify nearly all proteoforms found in traditional top-down proteomics searches (against the annotated proteome). SpectroGene is freely available at http://github.com/fenderglass/SpectroGene . PMID:26629978

  16. Finding genes in Schistosoma japonicum: annotating novel genomes with help of extrinsic evidence

    PubMed Central

    Brejová, Bro?a; Vina?, Tomáš; Chen, Yangyi; Wang, Shengyue; Zhao, Guoping; Brown, Daniel G.; Li, Ming; Zhou, Yan

    2009-01-01

    We have developed a novel method for estimating the parameters of hidden Markov models for gene finding in newly sequenced species. Our approach does not rely on curated training data sets, but instead uses extrinsic evidence (including paired-end ditags that have not been used in gene finding previously) and iterative training. This new method is particularly suitable for annotation of species with large evolutionary distance to the closest annotated species. We have used our approach to produce an initial annotation of more than 16 000 genes in the newly sequenced Schistosoma japonicum draft genome. We established the high quality of our predictions by comparison to full-length cDNAs (withdrawn from the extrinsic evidence) and to CEGMA core genes. We also evaluated the effectiveness of the new training procedure on Caenorhabditis elegans genome. ExonHunter and the newest parametric files for S. japonicum genome are available for download at www.bioinformatics.uwaterloo.ca/downloads/exonhunter PMID:19264800

  17. In silico functional pathway annotation of 86 established prostate cancer risk variants.

    PubMed

    Loo, Lenora W M; Fong, Aaron Y W; Cheng, Iona; Le Marchand, Loïc

    2015-01-01

    Heritability is one of the strongest risk factors of prostate cancer, emphasizing the importance of the genetic contribution towards prostate cancer risk. To date, 86 established prostate cancer risk variants have been identified by genome-wide association studies (GWAS). To determine if these risk variants are located near genes that interact together in biological networks or pathways contributing to prostate cancer initiation or progression, we generated gene sets based on proximity to the 86 prostate cancer risk variants. We took two approaches to generate gene lists. The first strategy included all immediate flanking genes, up- and downstream of the risk variant, regardless of distance from the index variant, and the second strategy included genes closest to the index GWAS marker and to variants in high LD (r2 ?0.8 in Europeans) with the index variant, within a 100 kb window up- and downstream. Pathway mapping of the two gene sets supported the importance of the androgen receptor-mediated signaling in prostate cancer biology. In addition, the hedgehog and Wnt/?-catenin signaling pathways were identified in pathway mapping for the flanking gene set. We also used the HaploReg resource to examine the 86 risk loci and variants high LD (r2 ?0.8) for functional elements. We found that there was a 12.8 fold (p = 2.9 x 10-4) enrichment for enhancer motifs in a stem cell line and a 4.4 fold (p = 1.1 x 10-3) enrichment of DNase hypersensitivity in a prostate adenocarcinoma cell line, indicating that the risk and correlated variants are enriched for transcriptional regulatory motifs. Our pathway-based functional annotation of the prostate cancer risk variants highlights the potential regulatory function that GWAS risk markers, and their highly correlated variants, exert on genes. Our study also shows that these genes may function cooperatively in key signaling pathways in prostate cancer biology. PMID:25658610

  18. Re-annotation of the CAZy genes of Trichoderma reesei and transcription in the presence of lignocellulosic substrates

    PubMed Central

    2012-01-01

    Background Trichoderma reesei is a soft rot Ascomycota fungus utilised for industrial production of secreted enzymes, especially lignocellulose degrading enzymes. About 30 carbohydrate active enzymes (CAZymes) of T. reesei have been biochemically characterised. Genome sequencing has revealed a large number of novel candidates for CAZymes, thus increasing the potential for identification of enzymes with novel activities and properties. Plenty of data exists on the carbon source dependent regulation of the characterised hydrolytic genes. However, information on the expression of the novel CAZyme genes, especially on complex biomass material, is very limited. Results In this study, the CAZyme gene content of the T. reesei genome was updated and the annotations of the genes refined using both computational and manual approaches. Phylogenetic analysis was done to assist the annotation and to identify functionally diversified CAZymes. The analyses identified 201 glycoside hydrolase genes, 22 carbohydrate esterase genes and five polysaccharide lyase genes. Updated or novel functional predictions were assigned to 44 genes, and the phylogenetic analysis indicated further functional diversification within enzyme families or groups of enzymes. GH3 β-glucosidases, GH27 α-galactosidases and GH18 chitinases were especially functionally diverse. The expression of the lignocellulose degrading enzyme system of T. reesei was studied by cultivating the fungus in the presence of different inducing substrates and by subjecting the cultures to transcriptional profiling. The substrates included both defined and complex lignocellulose related materials, such as pretreated bagasse, wheat straw, spruce, xylan, Avicel cellulose and sophorose. The analysis revealed co-regulated groups of CAZyme genes, such as genes induced in all the conditions studied and also genes induced preferentially by a certain set of substrates. Conclusions In this study, the CAZyme content of the T. reesei genome was updated, the discrepancies between the different genome versions and published literature were removed and the annotation of many of the genes was refined. Expression analysis of the genes gave information on the enzyme activities potentially induced by the presence of the different substrates. Comparison of the expression profiles of the CAZyme genes under the different conditions identified co-regulated groups of genes, suggesting common regulatory mechanisms for the gene groups. PMID:23035824

  19. A spectral approach integrating functional genomic annotations for coding and noncoding variants.

    PubMed

    Ionita-Laza, Iuliana; McCallum, Kenneth; Xu, Bin; Buxbaum, Joseph D

    2016-02-01

    Over the past few years, substantial effort has been put into the functional annotation of variation in human genome sequences. Such annotations can have a critical role in identifying putatively causal variants for a disease or trait among the abundant natural variation that occurs at a locus of interest. The main challenges in using these various annotations include their large numbers and their diversity. Here we develop an unsupervised approach to integrate these different annotations into one measure of functional importance (Eigen) that, unlike most existing methods, is not based on any labeled training data. We show that the resulting meta-score has better discriminatory ability using disease-associated and putatively benign variants from published studies (in both coding and noncoding regions) than the recently proposed CADD score. Across varied scenarios, the Eigen score performs generally better than any single individual annotation, representing a powerful single functional score that can be incorporated in fine-mapping studies. PMID:26727659

  20. Neurolinguistic Annotated Bibliography (Brain Research and Language Function) with Implications for Education.

    ERIC Educational Resources Information Center

    Davis, Wesley K.

    This bibliography presents annotations of 91 journal articles, books, chapters in books, and conference papers dating from 1967 to 1984 concerning neurolinguistics, language processing, and educational implications of brain research. The annotated bibliography includes eight items on neuroanatomy and language function; 20 items on neurolinguistics…

  1. Insect genome content phylogeny and functional annotation of core insect genomes.

    PubMed

    Rosenfeld, Jeffrey A; Foox, Jonathan; DeSalle, Rob

    2016-04-01

    Twenty-one fully sequenced and well annotated insect genomes were examined for genome content in a phylogenetic context. Gene presence/absence matrices and phylogenetic trees were constructed using several phylogenetic criteria. The role of e-value on phylogenetic analysis and genome content characterization is examined using scaled e-value cutoffs and a single linkage clustering approach to orthology determination. Previous studies have focused on the role of gene loss in terminals in the insect tree of life. The present study examines several common ancestral nodes in the insect tree. We suggest that the common ancestors of major insect groups like Diptera, Hymenoptera, Hemiptera and Holometabola experience more gene gain than gene loss. This suggests that as major insect groups arose, their genomic repertoire expanded through gene duplication (segmental duplications), followed by contraction by gene loss in specific terminal lineages. In addition, we examine the functional significance of the loss and gain of genes in the divergence of some of the major insect groups. PMID:26549428

  2. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis

    PubMed Central

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-01-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765

  3. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    PubMed

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/. PMID:26038725

  4. Structural and Functional Annotation of the Porcine Immunome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. H...

  5. GOsummaries: an R Package for Visual Functional Annotation of Experimental Data

    PubMed Central

    Kolde, Raivo; Vilo, Jaak

    2015-01-01

    Functional characterisation of gene lists using Gene Ontology (GO) enrichment analysis is a common approach in computational biology, since many analysis methods end up with a list of genes as a result. Often there can be hundreds of functional terms that are significantly associated with a single list of genes and proper interpretation of such results can be a challenging endeavour. There are methods to visualise and aid the interpretation of these results, but most of them are limited to the results associated with one list of genes. However, in practice the number of gene lists can be considerably higher and common tools are not effective in such situations. We introduce a novel R package, 'GOsummaries' that visualises the GO enrichment results as concise word clouds that can be combined together if the number of gene lists is larger. By also adding the graphs of corresponding raw experimental data, GOsummaries can create informative summary plots for various analyses such as differential expression or clustering. The case studies show that the GOsummaries plots allow rapid functional characterisation of complex sets of gene lists. The GOsummaries approach is particularly effective for Principal Component Analysis (PCA). By adding functional annotation to the principal components, GOsummaries improves  significantly the interpretability of PCA results. The GOsummaries layout for PCA can be effective even in situations where we cannot directly apply the GO analysis. For example, in case of metabolomics or metagenomics data it is possible to show the features with significant associations to the components instead of GO terms.   The GOsummaries package is available under GPL-2 licence at Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/GOsummaries.html). PMID:26913188

  6. RNA-Seq Analysis of Quercus pubescens Leaves: De Novo Transcriptome Assembly, Annotation and Functional Markers Development

    PubMed Central

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

  7. RNA-seq analysis of Quercus pubescens Leaves: de novo transcriptome assembly, annotation and functional markers development.

    PubMed

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

  8. Genome-wide computational identification and manual annotation of human long noncoding RNA genes

    PubMed Central

    Jia, Hui; Osak, Maureen; Bogu, Gireesh K.; Stanton, Lawrence W.; Johnson, Rory; Lipovich, Leonard

    2010-01-01

    Experimental evidence suggests that half or more of the mammalian transcriptome consists of noncoding RNA. Noncoding RNAs are divided into short noncoding RNAs (including microRNAs) and long noncoding RNAs (lncRNAs). We defined complementary DNAs (cDNAs) lacking any positive-strand open reading frames (ORFs) longer than 30 amino acids, as well as cDNAs lacking any evidence of interspecies conservation of their longer-than-30-amino acid ORFs, as noncoding. We have identified 5446 lncRNA genes in the human genome from ?24,000 full-length cDNAs, using our new ORF-prediction pipeline. We combined them nonredundantly with lncRNAs from four published sources to derive 6736 lncRNA genes. In an effort to distinguish standalone and antisense lncRNA genes from database artifacts, we stratified our catalog of lncRNAs according to the distance between each lncRNA gene candidate and its nearest known protein-coding gene. We concurrently examined the protein-coding capacity of known genes overlapping with lncRNAs. Remarkably, 62% of known genes with “hypothetical protein” names actually lacked protein-coding capacity. This study has greatly expanded the known human lncRNA catalog, increased its accuracy through manual annotation of cDNA-to-genome alignments, and revealed that a large set of hypothetical-protein genes in GenBank lacks protein-coding capacity. In addition, we have developed, independently of existing NCBI tools, command-line programs with high-throughput ORF-finding and BLASTP-parsing functionality, suitable for future automated assessments of protein-coding capacity of novel transcripts. PMID:20587619

  9. Genome-wide computational identification and manual annotation of human long noncoding RNA genes.

    PubMed

    Jia, Hui; Osak, Maureen; Bogu, Gireesh K; Stanton, Lawrence W; Johnson, Rory; Lipovich, Leonard

    2010-08-01

    Experimental evidence suggests that half or more of the mammalian transcriptome consists of noncoding RNA. Noncoding RNAs are divided into short noncoding RNAs (including microRNAs) and long noncoding RNAs (lncRNAs). We defined complementary DNAs (cDNAs) lacking any positive-strand open reading frames (ORFs) longer than 30 amino acids, as well as cDNAs lacking any evidence of interspecies conservation of their longer-than-30-amino acid ORFs, as noncoding. We have identified 5446 lncRNA genes in the human genome from approximately 24,000 full-length cDNAs, using our new ORF-prediction pipeline. We combined them nonredundantly with lncRNAs from four published sources to derive 6736 lncRNA genes. In an effort to distinguish standalone and antisense lncRNA genes from database artifacts, we stratified our catalog of lncRNAs according to the distance between each lncRNA gene candidate and its nearest known protein-coding gene. We concurrently examined the protein-coding capacity of known genes overlapping with lncRNAs. Remarkably, 62% of known genes with "hypothetical protein" names actually lacked protein-coding capacity. This study has greatly expanded the known human lncRNA catalog, increased its accuracy through manual annotation of cDNA-to-genome alignments, and revealed that a large set of hypothetical-protein genes in GenBank lacks protein-coding capacity. In addition, we have developed, independently of existing NCBI tools, command-line programs with high-throughput ORF-finding and BLASTP-parsing functionality, suitable for future automated assessments of protein-coding capacity of novel transcripts. PMID:20587619

  10. Functional annotation of the transcriptome of Sorghum bicolor in response to osmotic stress and abscisic acid

    PubMed Central

    2011-01-01

    Background Higher plants exhibit remarkable phenotypic plasticity allowing them to adapt to an extensive range of environmental conditions. Sorghum is a cereal crop that exhibits exceptional tolerance to adverse conditions, in particular, water-limiting environments. This study utilized next generation sequencing (NGS) technology to examine the transcriptome of sorghum plants challenged with osmotic stress and exogenous abscisic acid (ABA) in order to elucidate genes and gene networks that contribute to sorghum's tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought. Results RNA-Seq results revealed transcriptional activity of 28,335 unique genes from sorghum root and shoot tissues subjected to polyethylene glycol (PEG)-induced osmotic stress or exogenous ABA. Differential gene expression analyses in response to osmotic stress and ABA revealed a strong interplay among various metabolic pathways including abscisic acid and 13-lipoxygenase, salicylic acid, jasmonic acid, and plant defense pathways. Transcription factor analysis indicated that groups of genes may be co-regulated by similar regulatory sequences to which the expressed transcription factors bind. We successfully exploited the data presented here in conjunction with published transcriptome analyses for rice, maize, and Arabidopsis to discover more than 50 differentially expressed, drought-responsive gene orthologs for which no function had been previously ascribed. Conclusions The present study provides an initial assemblage of sorghum genes and gene networks regulated by osmotic stress and hormonal treatment. We are providing an RNA-Seq data set and an initial collection of transcription factors, which offer a preliminary look into the cascade of global gene expression patterns that arise in a drought tolerant crop subjected to abiotic stress. These resources will allow scientists to query gene expression and functional annotation in response to drought. PMID:22008187

  11. In Silico Structural and Functional Annotation of Hypothetical Proteins of Vibrio cholerae O139

    PubMed Central

    Islam, Md. Saiful; Shahik, Shah Md.; Sohel, Md.; Patwary, Noman I. A.

    2015-01-01

    In developing countries threat of cholera is a significant health concern whenever water purification and sewage disposal systems are inadequate. Vibrio cholerae is one of the responsible bacteria involved in cholera disease. The complete genome sequence of V. cholerae deciphers the presence of various genes and hypothetical proteins whose function are not yet understood. Hence analyzing and annotating the structure and function of hypothetical proteins is important for understanding the V. cholerae. V. cholerae O139 is the most common and pathogenic bacterial strain among various V. cholerae strains. In this study sequence of six hypothetical proteins of V. cholerae O139 has been annotated from NCBI. Various computational tools and databases have been used to determine domain family, protein-protein interaction, solubility of protein, ligand binding sites etc. The three dimensional structure of two proteins were modeled and their ligand binding sites were identified. We have found domains and families of only one protein. The analysis revealed that these proteins might have antibiotic resistance activity, DNA breaking-rejoining activity, integrase enzyme activity, restriction endonuclease, etc. Structural prediction of these proteins and detection of binding sites from this study would indicate a potential target aiding docking studies for therapeutic designing against cholera. PMID:26175663

  12. Multi-Trait GWAS and New Candidate Genes Annotation for Growth Curve Parameters in Brahman Cattle

    PubMed Central

    Crispim, Aline Camporez; Kelly, Matthew John; Guimarães, Simone Eliza Facioni; e Silva, Fabyano Fonseca; Fortes, Marina Rufino Salinas; Wenceslau, Raphael Rocha; Moore, Stephen

    2015-01-01

    Understanding the genetic architecture of beef cattle growth cannot be limited simply to the genome-wide association study (GWAS) for body weight at any specific ages, but should be extended to a more general purpose by considering the whole growth trajectory over time using a growth curve approach. For such an approach, the parameters that are used to describe growth curves were treated as phenotypes under a GWAS model. Data from 1,255 Brahman cattle that were weighed at birth, 6, 12, 15, 18, and 24 months of age were analyzed. Parameter estimates, such as mature weight (A) and maturity rate (K) from nonlinear models are utilized as substitutes for the original body weights for the GWAS analysis. We chose the best nonlinear model to describe the weight-age data, and the estimated parameters were used as phenotypes in a multi-trait GWAS. Our aims were to identify and characterize associated SNP markers to indicate SNP-derived candidate genes and annotate their function as related to growth processes in beef cattle. The Brody model presented the best goodness of fit, and the heritability values for the parameter estimates for mature weight (A) and maturity rate (K) were 0.23 and 0.32, respectively, proving that these traits can be a feasible alternative when the objective is to change the shape of growth curves within genetic improvement programs. The genetic correlation between A and K was -0.84, indicating that animals with lower mature body weights reached that weight at younger ages. One hundred and sixty seven (167) and two hundred and sixty two (262) significant SNPs were associated with A and K, respectively. The annotated genes closest to the most significant SNPs for A had direct biological functions related to muscle development (RAB28), myogenic induction (BTG1), fetal growth (IL2), and body weights (APEX2); K genes were functionally associated with body weight, body height, average daily gain (TMEM18), and skeletal muscle development (SMN1). Candidate genes emerging from this GWAS may inform the search for causative mutations that could underpin genomic breeding for improved growth rates. PMID:26445451

  13. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences

    PubMed Central

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C.; Rattei, Thomas; Mende, Daniel R.; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  14. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  15. Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

    PubMed Central

    2014-01-01

    Background Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT. Results We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses. Conclusion Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly. PMID:25411636

  16. Re-Annotation of Protein-Coding Genes in 10 Complete Genomes of Neisseriaceae Family by Combining Similarity-Based and Composition-Based Methods

    PubMed Central

    Guo, Feng-Biao; Xiong, Lifeng; Teng, Jade L. L.; Yuen, Kwok-Yung; Lau, Susanna K. P.; Woo, Patrick C. Y.

    2013-01-01

    In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes. PMID:23571676

  17. Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity-based and composition-based methods.

    PubMed

    Guo, Feng-Biao; Xiong, Lifeng; Teng, Jade L L; Yuen, Kwok-Yung; Lau, Susanna K P; Woo, Patrick C Y

    2013-06-01

    In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes. PMID:23571676

  18. Djinn Lite: a tool for customised gene transcript modelling, annotation-data enrichment and exploration

    PubMed Central

    Teber, Erdahl T; Crawford, Edward; Bolton, Kent B; Van Dyk, Derek; Schofield, Peter R; Kapoor, Vimal; Church, W Bret

    2006-01-01

    Background There is an ever increasing rate of data made available on genetic variation, transcriptomes and proteomes. Similarly, a growing variety of bioinformatic programs are becoming available from many diverse sources, designed to identify a myriad of sequence patterns considered to have potential biological importance within inter-genic regions, genes, transcripts, and proteins. However, biologists require easy to use, uncomplicated tools to integrate this information, visualise and print gene annotations. Integrating this information usually requires considerable informatics skills, and comprehensive knowledge of the data format to make full use of this information. Tools are needed to explore gene model variants by allowing users the ability to create alternative transcript models using novel combinations of exons not necessarily represented in current database deposits of mRNA/cDNA sequences. Results Djinn Lite is designed to be an intuitive program for storing and visually exploring of custom annotations relating to a eukaryotic gene sequence and its modelled gene products. In particular, it is helpful in developing hypothesis regarding alternate splicing of transcripts by allowing the construction of model transcripts and inspection of their resulting translations. It facilitates the ability to view a gene and its gene products in one synchronised graphical view, allowing one to drill down into sequence related data. Colour highlighting of selected sequences and added annotations further supports exploration, visualisation of sequence regions and motifs known or predicted to be biologically significant. Conclusion Gene annotating remains an ongoing and challengingtask that will continue as gene structures, gene transcription repertoires, disease loci, protein products and their interactions become moreprecisely defined. Djinn Lite offers an accessible interface to help accumulate, enrich, and individualise sequence annotations relating to a gene, its transcripts and translations. The mechanism of transcript definition and creation, and subsequent navigation and exploration of features, are very intuitive and demand only a short learning curve. Ultimately, Djinn Lite can form the basis for providing valuable clues to plan new experiments, providing storage of sequences and annotations for dedication to customised projects. The application is appropriate for Windows 98-ME-2000-XP-2003 operating systems. PMID:16426464

  19. Identification and computational annotation of genes differentially expressed in pulp development of Cocos nucifera L. by suppression subtractive hybridization

    PubMed Central

    2014-01-01

    Background Coconut (Cocos nucifera L.) is one of the world’s most versatile, economically important tropical crops. Little is known about the physiological and molecular basis of coconut pulp (endosperm) development and only a few coconut genes and gene product sequences are available in public databases. This study identified genes that were differentially expressed during development of coconut pulp and functionally annotated these identified genes using bioinformatics analysis. Results Pulp from three different coconut developmental stages was collected. Four suppression subtractive hybridization (SSH) libraries were constructed (forward and reverse libraries A and B between stages 1 and 2, and C and D between stages 2 and 3), and identified sequences were computationally annotated using Blast2GO software. A total of 1272 clones were obtained for analysis from four SSH libraries with 63% showing similarity to known proteins. Pairwise comparing of stage-specific gene ontology ids from libraries B-D, A-C, B-C and A-D showed that 32 genes were continuously upregulated and seven downregulated; 28 were transiently upregulated and 23 downregulated. KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis showed that 1-acyl-sn-glycerol-3-phosphate acyltransferase (LPAAT), phospholipase D, acetyl-CoA carboxylase carboxyltransferase beta subunit, 3-hydroxyisobutyryl-CoA hydrolase-like and pyruvate dehydrogenase E1 ? subunit were associated with fatty acid biosynthesis or metabolism. Triose phosphate isomerase, cellulose synthase and glucan 1,3-?-glucosidase were related to carbohydrate metabolism, and phosphoenolpyruvate carboxylase was related to both fatty acid and carbohydrate metabolism. Of 737 unigenes, 103 encoded enzymes were involved in fatty acid and carbohydrate biosynthesis and metabolism, and a number of transcription factors and other interesting genes with stage-specific expression were confirmed by real-time PCR, with validation of the SSH results as high as 66.6%. Based on determination of coconut endosperm fatty acids content by gas chromatography–mass spectrometry, a number of candidate genes in fatty acid anabolism were selected for further study. Conclusion Functional annotation of genes differentially expressed in coconut pulp development helped determine the molecular basis of coconut endosperm development. The SSH method identified genes related to fatty acids, carbohydrate and secondary metabolites. The results will be important for understanding gene functions and regulatory networks in coconut fruit. PMID:25084812

  20. The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction

    PubMed Central

    2012-01-01

    Background Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry. Results We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371?bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638?bp and 354?bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs. Conclusions We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S. lycopersicum, S. tuberosum, Capsicum spp, S. melongena and Petunia spp. PMID:22533342

  1. Comparative Analysis of Chloroplast Genomes: Functional Annotation, Genome-Based Phylogeny, and Deduced Evolutionary Patterns

    PubMed Central

    Rivas, Javier De Las; Lozano, Juan Jose; Ortiz, Angel R.

    2002-01-01

    All protein sequences from 19 complete chloroplast genomes (cpDNA) have been studied using a new computational method able to analyze functional correlations among series of protein sequences contained in complete proteomes. First, all open reading frames (ORFs) from the cpDNAs, comprising a total of 2266 protein sequences, were compared against the 3168 proteins from Synechocystis PCC6803 complete genome to find functionally related orthologous proteins. Additionally, all cpDNA genomes were pairwise compared to find orthologous groups not present in cyanobacteria. Annotations in the cluster of othologous proteins database and CyanoBase were used as reference for the functional assignments. Following this protocol, new functional assignments were made for ORFs of unknown function and for ycfs (hypothetical chloroplast frames), which still lack a functional assignment. Using this information, a matrix of functional relationships was derived from profiles of the presence and/or absence of orthologous proteins; the matrix included 1837 proteins in 277 orthologous clusters. A factor analysis study of this matrix, followed by cluster analysis, allowed us to obtain accurate phylogenetic reconstructions and the detection of genes probably involved in speciation as phylogenetic correlates. Finally, by grouping common evolutionary patterns, we show that it is possible to determine functionally linked protein networks. This has allowed us to suggest putative associations for some unknown ORFs. PMID:11932241

  2. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    PubMed

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation. PMID:19396742

  3. ANNOTATION OF TRIBOLIUM CUTICLE PROTEIN AND PERITROPHIN GENES

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The recently completed genome sequence of the hard-bodied beetle, Tribolium, could reveal new insights into genetic mechanisms for chitin and cuticle production in pest insects. The genome sequence is being "mined" for cuticle genes using a combination of automated and manual gene-finding procedure...

  4. Functional annotation by identification of local surface similarities: a novel tool for structural genomics

    PubMed Central

    Ferrè, Fabrizio; Ausiello, Gabriele; Zanzoni, Andreas; Helmer-Citterich, Manuela

    2005-01-01

    Background Protein function is often dependent on subsets of solvent-exposed residues that may exist in a similar three-dimensional configuration in non homologous proteins thus having different order and/or spacing in the sequence. Hence, functional annotation by means of sequence or fold similarity is not adequate for such cases. Results We describe a method for the function-related annotation of protein structures by means of the detection of local structural similarity with a library of annotated functional sites. An automatic procedure was used to annotate the function of local surface regions. Next, we employed a sequence-independent algorithm to compare exhaustively these functional patches with a larger collection of protein surface cavities. After tuning and validating the algorithm on a dataset of well annotated structures, we applied it to a list of protein structures that are classified as being of unknown function in the Protein Data Bank. By this strategy, we were able to provide functional clues to proteins that do not show any significant sequence or global structural similarity with proteins in the current databases. Conclusion This method is able to spot structural similarities associated to function-related similarities, independently on sequence or fold resemblance, therefore is a valuable tool for the functional analysis of uncharacterized proteins. Results are available at PMID:16076399

  5. Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project

    PubMed Central

    Horton, Roger; Gibson, Richard; Coggill, Penny; Miretti, Marcos; Allcock, Richard J.; Almeida, Jeff; Forbes, Simon; Gilbert, James G. R.; Halls, Karen; Harrow, Jennifer L.; Hart, Elizabeth; Howe, Kevin; Jackson, David K.; Palmer, Sophie; Roberts, Anne N.; Sims, Sarah; Stewart, C. Andrew; Traherne, James A.; Trevanion, Steve; Wilming, Laurens; Rogers, Jane; de Jong, Pieter J.; Elliott, John F.; Sawcer, Stephen; Todd, John A.; Trowsdale, John

    2008-01-01

    The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine. PMID:18193213

  6. A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis

    PubMed Central

    2011-01-01

    Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). PMID:21668950

  7. Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon

    PubMed Central

    2010-01-01

    Background Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, at the levels of the whole genome and individual glycoside hydrolase families. Results We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. For several glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51), we present a detailed literature review together with an examination of the family structures. This analysis of individual families revealed both similarities and distinctions between monocots and eudicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within GH families, the Brachypodium and sorghum proteins generally cluster with those from other monocots. Conclusions This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a grass model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses. PMID:20973991

  8. Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon

    SciTech Connect

    Tyler, Ludmila; Bragg, Jennifer; Wu, Jiajie; Yang, Xiaohan; Tuskan, Gerald A; Vogel, John

    2010-01-01

    Background Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, both at the whole-genome level and at the level of individual glycoside hydrolase families. Results We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. Examination of individual glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51) revealed both similarities and distinctions between monocots and dicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within families, the Brachypodium and sorghum proteins generally cluster with those from other monocots. Conclusions This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a monocot model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses.

  9. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    SciTech Connect

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.

  10. Functional annotation of an expressed sequence tag library from Haliotis diversicolor and analysis of its plant-like sequences.

    PubMed

    Jiang, Jing-Zhe; Zhang, Wei; Guo, Zhi-Xun; Cai, Chen-Chen; Su, You-Lu; Wang, Rui-Xuan; Wang, Jiang-Yong

    2011-09-01

    The small abalone, Haliotis diversicolor, is a widely distributed and cultured species in the subtropical coastal area of China. To identify and classify functional genes of this important species, a normalized expressed sequence tag (EST) library, including 7069 high quality ESTs from the total body of H. diversicolor, was analyzed. A total of 4781 unigenes were assembled and 2991 novel abalone genes were identified. The GC content, codon and amino acid usage of the transcriptome were analyzed. For the accurate annotation of the abalone library, different influencing factors were evaluated. The gene ontology (GO) database provided a higher annotation rate (69.6%), and sequences longer than 800bp were easily subjected to a BLAST search. The taxonomy of the BLAST results showed that lancelet and invertebrates are most closely related to abalone. Sixty-seven identified plant-like genes were further examined by reverse transcription-polymerase chain reaction (RT-PCR) and sequencing, only seven of these were real transcripts in abalone. Phylogenic trees were also constructed to illustrate the positions of two Cystatin sequences and one Calmodulin protein sequence identified in abalone. To perform functional classification, three different databases (GO, KEGG and COG) were used and 60 immune or disease-related unigenes were determined. This work has greatly enlarged the known gene pool of H. diversicolor and will have important implications for future molecular and genetic analyses in this organism. PMID:21867971

  11. ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes

    PubMed Central

    Gupta, Sushim Kumar; Padmanabhan, Babu Roshan; Diene, Seydina M.; Lopez-Rojas, Rafael; Kempf, Marie; Landraud, Luce

    2014-01-01

    ARG-ANNOT (Antibiotic Resistance Gene-ANNOTation) is a new bioinformatic tool that was created to detect existing and putative new antibiotic resistance (AR) genes in bacterial genomes. ARG-ANNOT uses a local BLAST program in Bio-Edit software that allows the user to analyze sequences without a Web interface. All AR genetic determinants were collected from published works and online resources; nucleotide and protein sequences were retrieved from the NCBI GenBank database. After building a database that includes 1,689 antibiotic resistance genes, the software was tested in a blind manner using 100 random sequences selected from the database to verify that the sensitivity and specificity were at 100% even when partial sequences were queried. Notably, BLAST analysis results obtained using the rmtF gene sequence (a new aminoglycoside-modifying enzyme gene sequence that is not included in the database) as a query revealed that the tool was able to link this sequence to short sequences (17 to 40 bp) found in other genes of the rmt family with significant E values. Finally, the analysis of 178 Acinetobacter baumannii and 20 Staphylococcus aureus genomes allowed the detection of a significantly higher number of AR genes than the Resfinder gene analyzer and 11 point mutations in target genes known to be associated with AR. The average time for the analysis of a genome was 3.35 ± 0.13 min. We have created a concise database for BLAST using a Bio-Edit interface that can detect AR genetic determinants in bacterial genomes and can rapidly and easily discover putative new AR genetic determinants. PMID:24145532

  12. Functional Annotation of Proteomic Data from Chicken Heterophils and Macrophages Induced by Carbon Nanotube Exposure

    PubMed Central

    Li, Yun-Ze; Cheng, Chung-Shi; Chen, Chao-Jung; Li, Zi-Lin; Lin, Yao-Tung; Chen, Shuen-Ei; Huang, San-Yuan

    2014-01-01

    With the expanding applications of carbon nanotubes (CNT) in biomedicine and agriculture, questions about the toxicity and biocompatibility of CNT in humans and domestic animals are becoming matters of serious concern. This study used proteomic methods to profile gene expression in chicken macrophages and heterophils in response to CNT exposure. Two-dimensional gel electrophoresis identified 12 proteins in macrophages and 15 in heterophils, with differential expression patterns in response to CNT co-incubation (0, 1, 10, and 100 ?g/mL of CNT for 6 h) (p < 0.05). Gene ontology analysis showed that most of the differentially expressed proteins are associated with protein interactions, cellular metabolic processes, and cell mobility, suggesting activation of innate immune functions. Western blot analysis with heat shock protein 70, high mobility group protein, and peptidylprolyl isomerase A confirmed the alterations of the profiled proteins. The functional annotations were further confirmed by effective cell migration, promoted interleukin-1? secretion, and more cell death in both macrophages and heterophils exposed to CNT (p < 0.05). In conclusion, results of this study suggest that CNT exposure affects protein expression, leading to activation of macrophages and heterophils, resulting in altered cytoskeleton remodeling, cell migration, and cytokine production, and thereby mediates tissue immune responses. PMID:24823882

  13. The power of EST sequence data: Relation to Acyrthosiphon pisum genome annotation and functional genomics initiatives

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genes important to aphid biology, survival and reproduction were successfully identified by use of a genomics approach. We created and described the Sequencing, compilation, and annotation of the approxiamtely 525Mb nuclear genome of the pea aphid, Acyrthosiphon pisum, which represents an important ...

  14. Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

    PubMed Central

    Gilchrist, Michael J; Christensen, Mikkel B; Harland, Richard; Pollet, Nicolas; Smith, James C; Ueno, Naoto; Papalopulu, Nancy

    2008-01-01

    Background Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links. Results Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species. Conclusion This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources. PMID:18928517

  15. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

    PubMed Central

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-01-01

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu PMID:26015273

  16. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    PubMed Central

    2010-01-01

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene Atlas was generated from 7.2 million unique digital gene expression tag sequences (300.2 million total raw tag sequences), from which 1.59 million unique tag sequences were identified that mapped to the draft bovine genome accounting for 85% of the total raw tag abundance. Filtering these tags yielded 87,764 unique tag sequences that unambiguously mapped to 16,517 annotated protein-coding loci in the draft genome accounting for 45% of the total raw tag abundance. Clustering of tissues based on tag abundance profiles generally confirmed ontology classification based on anatomy. There were 5,429 constitutively expressed loci and 3,445 constitutively expressed unique tag sequences mapping outside annotated gene boundaries that represent a resource for enhancing current gene models. Physical measures such as inferred transcript length or antisense tag abundance identified tissues with atypical transcriptional tag profiles. We report for the first time the tissue-specific variation in the proportion of mitochondrial transcriptional tag abundance. Conclusions The Bovine Gene Atlas is the deepest and broadest transcriptome survey of any livestock genome to date. Commonalities and variation in sense and antisense transcript tag profiles identified in different tissues facilitate the examination of the relationship between gene expression, tissue, and gene function. PMID:20961407

  17. Annotation and retrieval system of CAD models based on functional semantics

    NASA Astrophysics Data System (ADS)

    Wang, Zhansong; Tian, Ling; Duan, Wenrui

    2014-11-01

    CAD model retrieval based on functional semantics is more significant than content-based 3D model retrieval during the mechanical conceptual design phase. However, relevant research is still not fully discussed. Therefore, a functional semantic-based CAD model annotation and retrieval method is proposed to support mechanical conceptual design and design reuse, inspire designer creativity through existing CAD models, shorten design cycle, and reduce costs. Firstly, the CAD model functional semantic ontology is constructed to formally represent the functional semantics of CAD models and describe the mechanical conceptual design space comprehensively and consistently. Secondly, an approach to represent CAD models as attributed adjacency graphs(AAG) is proposed. In this method, the geometry and topology data are extracted from STEP models. On the basis of AAG, the functional semantics of CAD models are annotated semi-automatically by matching CAD models that contain the partial features of which functional semantics have been annotated manually, thereby constructing CAD Model Repository that supports model retrieval based on functional semantics. Thirdly, a CAD model retrieval algorithm that supports multi-function extended retrieval is proposed to explore more potential creative design knowledge in the semantic level. Finally, a prototype system, called Functional Semantic-based CAD Model Annotation and Retrieval System(FSMARS), is implemented. A case demonstrates that FSMARS can successfully botain multiple potential CAD models that conform to the desired function. The proposed research addresses actual needs and presents a new way to acquire CAD models in the mechanical conceptual design phase.

  18. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  19. BioBuilder as a database development and functional annotation platform for proteins

    PubMed Central

    Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

    2004-01-01

    Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

  20. DBH2H: vertebrate head-to-head gene pairs annotated at genomic and post-genomic levels

    PubMed Central

    Yu, Hui; Yu, Fu-Dong; Zhang, Guo-Qing; Shen, Xiang; Chen, Yun-Qin; Li, Yuan-Yuan; Li, Yi-Xue

    2009-01-01

    DBH2H collects head-to-head (h2h) gene pairs identified from human, mouse, rat, chicken and fugu genomes, and distinguishes the ortholog mapping relationship among them. The gene pairs in DBH2H are annotated with sequential features including single nucleotide polymorphisms, CpG islands and transcription factor binding sites, as well as functional terms and genetic disorders. In addition, the expression correlation information based on 117 microarray datasets is included. By providing user-friendly access to these data, DBH2H represents a valuable resource for further analyses of this important gene arrangement in terms of transcriptional regulation mechanisms, evolutionary conservation, disease relevance, etc. Database URL: http://lifecenter.sgst.cn/h2h/ PMID:20157479

  1. Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology

    PubMed Central

    Fontana, Paolo; Cestaro, Alessandro; Velasco, Riccardo; Formentin, Elide; Toppo, Stefano

    2009-01-01

    Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage. PMID:19247487

  2. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    DOE PAGESBeta

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genesmore » and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.« less

  3. BioGPS: building your own mash-up of gene annotations and expression profiles

    PubMed Central

    Wu, Chunlei; Jin, Xuefeng; Tsueng, Ginger; Afrasiabi, Cyrus; Su, Andrew I.

    2016-01-01

    BioGPS (http://biogps.org) is a centralized gene-annotation portal that enables researchers to access distributed gene annotation resources. This article focuses on the updates to BioGPS since our last paper (2013 database issue). The unique features of BioGPS, compared to those of other gene portals, are its community extensibility and user customizability. Users contribute the gene-specific resources accessible from BioGPS (‘plugins’), which helps ensure that the resource collection is always up-to-date and that it will continue expanding over time (since the 2013 paper, 162 resources have been added, for a 34% increase in the number of resources available). BioGPS users can create their own collections of relevant plugins and save them as customized gene-report pages or ‘layouts’ (since the 2013 paper, 488 user-created layouts have been added, for a 22% increase in the number of layouts). In addition, we recently updated the most popular plugin, the ‘Gene expression/activity chart’, to include ?6000 datasets (from ?2000 datasets) and we enhanced user interactivity. We also added a new ‘gene list’ feature that allows users to save query results for future reference. PMID:26578587

  4. BioGPS: building your own mash-up of gene annotations and expression profiles.

    PubMed

    Wu, Chunlei; Jin, Xuefeng; Tsueng, Ginger; Afrasiabi, Cyrus; Su, Andrew I

    2016-01-01

    BioGPS (http://biogps.org) is a centralized gene-annotation portal that enables researchers to access distributed gene annotation resources. This article focuses on the updates to BioGPS since our last paper (2013 database issue). The unique features of BioGPS, compared to those of other gene portals, are its community extensibility and user customizability. Users contribute the gene-specific resources accessible from BioGPS ('plugins'), which helps ensure that the resource collection is always up-to-date and that it will continue expanding over time (since the 2013 paper, 162 resources have been added, for a 34% increase in the number of resources available). BioGPS users can create their own collections of relevant plugins and save them as customized gene-report pages or 'layouts' (since the 2013 paper, 488 user-created layouts have been added, for a 22% increase in the number of layouts). In addition, we recently updated the most popular plugin, the 'Gene expression/activity chart', to include ?6000 datasets (from ?2000 datasets) and we enhanced user interactivity. We also added a new 'gene list' feature that allows users to save query results for future reference. PMID:26578587

  5. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

    PubMed

    Paul, Sandip; Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2015-12-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing the pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen - a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for a species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars - Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. PMID:26456591

  6. Gene function prediction based on the Gene Ontology hierarchical structure.

    PubMed

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship. PMID:25192339

  7. Generation, functional annotation and comparative analysis of black spruce (Picea mariana) ESTs: an important conifer genomic resource

    PubMed Central

    2013-01-01

    Background EST (expressed sequence tag) sequences and their annotation provide a highly valuable resource for gene discovery, genome sequence annotation, and other genomics studies that can be applied in genetics, breeding and conservation programs for non-model organisms. Conifers are long-lived plants that are ecologically and economically important globally, and have a large genome size. Black spruce (Picea mariana), is a transcontinental species of the North American boreal and temperate forests. However, there are limited transcriptomic and genomic resources for this species. The primary objective of our study was to develop a black spruce transcriptomic resource to facilitate on-going functional genomics projects related to growth and adaptation to climate change. Results We conducted bidirectional sequencing of cDNA clones from a standard cDNA library constructed from black spruce needle tissues. We obtained 4,594 high quality (2,455 5' end and 2,139 3' end) sequence reads, with an average read-length of 532 bp. Clustering and assembly of ESTs resulted in 2,731 unique sequences, consisting of 2,234 singletons and 497 contigs. Approximately two-thirds (63%) of unique sequences were functionally annotated. Genes involved in 36 molecular functions and 90 biological processes were discovered, including 24 putative transcription factors and 232 genes involved in photosynthesis. Most abundantly expressed transcripts were associated with photosynthesis, growth factors, stress and disease response, and transcription factors. A total of 216 full-length genes were identified. About 18% (493) of the transcripts were novel, representing an important addition to the Genbank EST database (dbEST). Fifty-seven di-, tri-, tetra- and penta-nucleotide simple sequence repeats were identified. Conclusions We have developed the first high quality EST resource for black spruce and identified 493 novel transcripts, which may be species-specific related to life history and ecological traits. We have also identified full-length genes and microsatellite-containing ESTs. Based on EST sequence similarities, black spruce showed close evolutionary relationships with congeneric Picea glauca and Picea sitchensis compared to other Pinaceae members and angiosperms. The EST sequences reported here provide an important resource for genome annotation, functional and comparative genomics, molecular breeding, conservation and management studies and applications in black spruce and related conifer species. PMID:24119028

  8. HapScope: a software system for automated and visual analysis of functionally annotated haplotypes

    PubMed Central

    Zhang, Jinghui; Rowe, William L.; Struewing, Jeffery P.; Buetow, Kenneth H.

    2002-01-01

    We have developed a software analysis package, HapScope, which includes a comprehensive analysis pipeline and a sophisticated visualization tool for analyzing functionally annotated haplotypes. The HapScope analysis pipeline supports: (i) computational haplotype construction with an expectation-maximization or Bayesian statistical algorithm; (ii) SNP classification by protein coding change, homology to model organisms or putative regulatory regions; and (iii) minimum SNP subset selection by either a Brute Force Algorithm or a Greedy Partition Algorithm. The HapScope viewer displays genomic structure with haplotype information in an integrated environment, providing eight alternative views for assessing genetic and functional correlation. It has a user-friendly interface for: (i) haplotype block visualization; (ii) SNP subset selection; (iii) haplotype consolidation with subset SNP markers; (iv) incorporation of both experimentally determined haplotypes and computational results; and (v) data export for additional analysis. Comparison of haplotypes constructed by the statistical algorithms with those determined experimentally shows variation in haplotype prediction accuracies in genomic regions with different levels of nucleotide diversity. We have applied HapScope in analyzing haplotypes for candidate genes and genomic regions with extensive SNP and genotype data. We envision that the systematic approach of integrating functional genomic analysis with population haplotypes, supported by HapScope, will greatly facilitate current genetic disease research. PMID:12466546

  9. Identification and annotation of abiotic stress responsive candidate genes in peanut ESTs.

    PubMed

    Kumari, Archana; Kumar, Ashutosh; Wany, Aakanksha; Prajapati, Gopal Kumar; Pandey, Dev Mani

    2012-01-01

    Peanut (Arachis hypogaea L.) ranks fifth among the world oil crops and is widely grown in India and neighbouring countries. Due to its large and unknown genome size, studies on genomics and genetic modification of peanut are still scanty as compared to other model crops like Arabidopsis, rice, cotton and soybean. Because of its favourable cultivation in semi-arid regions, study on abiotic stress responsive genes and its regulation in peanut is very much important. Therefore, we aim to identify and annotate the abiotic stress responsive candidate genes in peanut ESTs. Expression data of drought stress responsive corresponding genes and EST sequences were screened from dot blot experiments shown as heat maps and supplementary tables, respectively as reported by Govind et al. (2009). Some of the screened genes having no information about their ESTs in above mentioned supplementary tables were retrieved from NCBI. A phylogenetic analysis was performed to find a group of utmost similar ESTs for each selected gene. Individual EST of the said group were further searched in peanut ESTs (1,78,490 whole EST sequences) using stand alone BLAST. For the prediction as well as annotation of abiotic stress responsive selected genes, various tools (like Vec-Screen, Repeat Masker, EST-Trimmer, DNA Baser, WISE2 and I-TASSER) were used. Here we report the predicted result of Contigs, domain as well as 3D structure for HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein. PMID:23275722

  10. VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment

    PubMed Central

    Habegger, Lukas; Balasubramanian, Suganthi; Chen, David Z.; Khurana, Ekta; Sboner, Andrea; Harmanci, Arif; Rozowsky, Joel; Clarke, Declan; Snyder, Michael; Gerstein, Mark

    2012-01-01

    Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment. Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org. Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:22743228

  11. MeSH key terms for validation and annotation of gene expression clusters

    SciTech Connect

    Rechtsteiner, A.; Rocha, L. M.

    2004-01-01

    Integration of different sources of information is a great challenge for the analysis of gene expression data, and for the field of Functional Genomics in general. As the availability of numerical data from high-throughput methods increases, so does the need for technologies that assist in the validation and evaluation of the biological significance of results extracted from these data. In mRNA assaying with microarrays, for example, numerical analysis often attempts to identify clusters of co-expressed genes. The important task to find the biological significance of the results and validate them has so far mostly fallen to the biological expert who had to perform this task manually. One of the most promising avenues to develop automated and integrative technology for such tasks lies in the application of modern Information Retrieval (IR) and Knowledge Management (KM) algorithms to databases with biomedical publications and data. Examples of databases available for the field are bibliographic databases c ntaining scientific publications (e.g. MEDLINE/PUBMED), databases containing sequence data (e.g. GenBank) and databases of semantic annotations (e.g. the Gene Ontology Consortium and Medical Subject Headings (MeSH)). We present here an approach that uses the MeSH terms and their concept hierarchies to validate and obtain functional information for gene expression clusters. The controlled and hierarchical MeSH vocabulary is used by the National Library of Medicine (NLM) to index all the articles cited in MEDLINE. Such indexing with a controlled vocabulary eliminates some of the ambiguity due to polysemy (terms that have multiple meanings) and synonymy (multiple terms have similar meaning) that would be encountered if terms would be extracted directly from the articles due to differing article contexts or author preferences and background. Further, the hierarchical organization of the MeSH terms can illustrate the conceptuallfunctional relationships of genes associated with MeSH terms. MeSH terms can be associated with genes through co-occurrence of these in MEDLINE citations, i.e. the genes occur in titles or abstracts and the MeSH terms are assigned by experts. To identify MeSH terms associated with a group of genes we used the tool MESHGENE developed at the Information Dynamics Lab at HP Labs (http://www-idl.hpl.hp.com/meshgene/). When presented with a list of human genes, MESHGENE uses some sophisticated techniques to search for these gene symbols in the titles and abstracts of all MEDLINE citations. MeSH terms and the number of co-occurrences can be retrieved. Gene symbols that are aliases of each other are pooled from several databases. This addresses the problem of synonymy, the fact that several symbols can refer to the same gene. MESHGENE employs some sophisticated algorithms that disregards symbols that are likely to be acronyms for other concepts than a gene. This addresses the problem of polysemy, i.e. possible multiple meanings of a gene symbol. We applied our approach to gene expression data from herpes virus infected human fibroblast cells. The data contains 12 time-points, between 1/2 hrs and 48 hrs after infection. Singular Value Decomposition was used to identify the dominant modes of expression. 75% of the variance in the expression data was captured by the first two modes, the first exhibiting a monotonly increasing expression pattern and the second a more transient pattern. Projection of the gene expression vectors onto this first two modes identified 3 statistically significant clusters of co-expressed genes. 500 genes from cluster 1 and 300 genes from clusters 2 and 3 each were uploaded to MESHGENE and the MeSH terms and co-occurrence values were retrieved. MeSH terms were also obtained for 5 groups of randomly selected genes with similar numbers of genes. The log was taken of the co-occurrence values and for each MeSH term these log co-occurrence values were summed for each group over the genes in that group. A matrix with 8 columns for the 8 groups of genes and with 14,000 rows with the MeSH terms

  12. De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

    PubMed Central

    Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

    2012-01-01

    Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603

  13. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  14. The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation

    PubMed Central

    Xia, Dong; Sanderson, Sanya J; Jones, Andrew R; Prieto, Judith H; Yates, John R; Bromley, Elizabeth; Tomley, Fiona M; Lal, Kalpana; Sinden, Robert E; Brunk, Brian P; Roos, David S; Wastling, Jonathan M

    2008-01-01

    Background Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource. Results The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression. Conclusion Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible. PMID:18644147

  15. Hunting for genes by functional screens.

    PubMed

    Kiss-Toth, Endre; Qwarnstrom, Eva E; Dower, Steven K

    2004-01-01

    Advances in high throughput sequencing technologies have led to an explosion of sequence information available for today's researchers. Efforts in the emerging next phase of the genomic era are focusing on the assignment of function to genes uncovered by genome sequencing programs. The main approaches include high throughput mutagenesis, predictions based on homology in primary sequence, microarray and proteomics. Despite the variety of strategies applied, only 30% of predicted human genes have any function assigned. There is a need, therefore, for additional tools to overcome some of the limitations of existing techniques. In this review we discuss some recent developments and their impact on gene function annotation, especially as they relate to the elucidation of signalling cascades activated by cytokines and growth factors. PMID:15110793

  16. Judging the quality of gene expression-based clustering methods using gene annotation.

    PubMed

    Gibbons, Francis D; Roth, Frederick P

    2002-10-01

    We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results. PMID:12368250

  17. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation

    PubMed Central

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de PMID:24865352

  18. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation.

    PubMed

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de. PMID:24865352

  19. Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach

    PubMed Central

    2014-01-01

    Background Induced resistance (IR) can be part of a sustainable plant protection strategy against important plant diseases. β-aminobutyric acid (BABA) can induce resistance in a wide range of plants against several types of pathogens, including potato infected with Phytophthora infestans. However, the molecular mechanisms behind this are unclear and seem to be dependent on the system studied. To elucidate the defence responses activated by BABA in potato, a genome-wide transcript microarray analysis in combination with label-free quantitative proteomics analysis of the apoplast secretome were performed two days after treatment of the leaf canopy with BABA at two concentrations, 1 and 10 mM. Results Over 5000 transcripts were differentially expressed and over 90 secretome proteins changed in abundance indicating a massive activation of defence mechanisms with 10 mM BABA, the concentration effective against late blight disease. To aid analysis, we present a more comprehensive functional annotation of the microarray probes and gene models by retrieving information from orthologous gene families across 26 sequenced plant genomes. The new annotation provided GO terms to 8616 previously un-annotated probes. Conclusions BABA at 10 mM affected several processes related to plant hormones and amino acid metabolism. A major accumulation of PR proteins was also evident, and in the mevalonate pathway, genes involved in sterol biosynthesis were down-regulated, whereas several enzymes involved in the sesquiterpene phytoalexin biosynthesis were up-regulated. Interestingly, abscisic acid (ABA) responsive genes were not as clearly regulated by BABA in potato as previously reported in Arabidopsis. Together these findings provide candidates and markers for improved resistance in potato, one of the most important crops in the world. PMID:24773703

  20. VIRGO: computational prediction of gene functions.

    PubMed

    Massjouni, Naveed; Rivera, Corban G; Murali, T M

    2006-07-01

    Dramatic advances in sequencing technology and sophisticated experimental assays that interrogate the cell, combined with the public availability of the resulting data, herald the era of systems biology. However, the biological functions of more than 40% of the genes in sequenced genomes are unknown, posing a fundamental barrier to progress in systems biology. The large scale and diversity of available data requires the development of techniques that can automatically utilize these datasets to make quantified and robust predictions of gene function that can be experimentally verified. We present a service called the VIRtual Gene Ontology (VIRGO) that (i) constructs a functional linkage network (FLN) from gene expression and molecular interaction data, (ii) labels genes in the FLN with their functional annotations in the Gene Ontology and (iii) systematically propagates these labels across the FLN in order to precisely predict the functions of unlabelled genes. VIRGO assigns confidence estimates to predicted functions so that a biologist can prioritize predictions for further experimental study. For each prediction, VIRGO also provides an informative 'propagation diagram' that traces the flow of information in the FLN that led to the prediction. VIRGO is available at http://whipple.cs.vt.edu:8080/virgo. PMID:16845022

  1. Improved structural annotation of protein-coding genes in the Meloidogyne hapla genome using RNA-Seq

    PubMed Central

    Guo, Yuelong; Bird, David McK; Nielsen, Dahlia M

    2014-01-01

    As high-throughput cDNA sequencing (RNA-Seq) is increasingly applied to hypothesis-driven biological studies, the prediction of protein coding genes based on these data are usurping strictly in silico approaches. Compared with computationally derived gene predictions, structural annotation is more accurate when based on biological evidence, particularly RNA-Seq data. Here, we refine the current genome annotation for the Meloidogyne hapla genome utilizing RNA-Seq data. Published structural annotation defines 14?420 protein-coding genes in the M. hapla genome. Of these, 25% (3751) were found to exhibit some incongruence with RNA-Seq data. Manual annotation enabled these discrepancies to be resolved. Our analysis revealed 544 new gene models that were missing from the prior annotation. Additionally, 1457 transcribed regions were newly identified on the ends of as-yet-unjoined contigs. We also searched for trans-spliced leaders, and based on RNA-Seq data, identified genes that appear to be trans-spliced. Four 22-bp trans-spliced leaders were identified using our pipeline, including the known trans-spliced leader, which is the M. hapla ortholog of SL1. In silico predictions of trans-splicing were validated by comparison with earlier results derived from an independent cDNA library constructed to capture trans-spliced transcripts. The new annotation, which we term HapPep5, is publically available at www.hapla.org. PMID:25254153

  2. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    SciTech Connect

    Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; de Fatima Bonaldo, Maria; Bono Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; Gopinath, Gopal R.; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba Rie; et al.

    2004-01-15

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4 percent of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5 percent of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes . In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

  3. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    PubMed Central

    2004-01-01

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology. PMID:15103394

  4. GeneMANIA: Fast gene network construction and function prediction for Cytoscape

    PubMed Central

    Montojo, Jason; Zuberi, Khalid; Rodriguez, Harold; Bader, Gary D.; Morris, Quaid

    2014-01-01

    The GeneMANIA Cytoscape app enables users to construct a composite gene-gene functional interaction network from a gene list. The resulting network includes the genes most related to the original list, and functional annotations from Gene Ontology. The edges are annotated with details about the publication or data source the interactions were derived from. The app leverages GeneMANIA’s database of 1800+ networks, containing over 500 million interactions spanning 8 organisms: A. thaliana, C. elegans, D. melanogaster, D. rerio, H. sapiens, M. musculus, R. norvegicus, and S. cerevisiae. Users may also import their own organisms, networks, and expression profiles. The app is compatible with Cytoscape versions 2 and 3. PMID:25254104

  5. Comprehensive functional annotation of 18 missense mutations found in suspected hemochromatosis type 4 patients.

    PubMed

    Callebaut, Isabelle; Joubrel, Rozenn; Pissard, Serge; Kannengiesser, Caroline; Gérolami, Victoria; Ged, Cécile; Cadet, Estelle; Cartault, François; Ka, Chandran; Gourlaouen, Isabelle; Gourhant, Lénaick; Oudin, Claire; Goossens, Michel; Grandchamp, Bernard; De Verneuil, Hubert; Rochette, Jacques; Férec, Claude; Le Gac, Gérald

    2014-09-01

    Hemochromatosis type 4 is a rare form of primary iron overload transmitted as an autosomal dominant trait caused by mutations in the gene encoding the iron transport protein ferroportin 1 (SLC40A1). SLC40A1 mutations fall into two functional categories (loss- versus gain-of-function) underlying two distinct clinical entities (hemochromatosis type 4A versus type 4B). However, the vast majority of SLC40A1 mutations are rare missense variations, with only a few showing strong evidence of causality. The present study reports the results of an integrated approach collecting genetic and phenotypic data from 44 suspected hemochromatosis type 4 patients, with comprehensive structural and functional annotations. Causality was demonstrated for 10 missense variants, showing a clear dichotomy between the two hemochromatosis type 4 subtypes. Two subgroups of loss-of-function mutations were distinguished: one impairing cell-surface expression and one altering only iron egress. Additionally, a new gain-of-function mutation was identified, and the degradation of ferroportin on hepcidin binding was shown to probably depend on the integrity of a large extracellular loop outside of the hepcidin-binding domain. Eight further missense variations, on the other hand, were shown to have no discernible effects at either protein or RNA level; these were found in apparently isolated patients and were associated with a less severe phenotype. The present findings illustrate the importance of combining in silico and biochemical approaches to fully distinguish pathogenic SLC40A1 mutations from benign variants. This has profound implications for patient management. PMID:24714983

  6. Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database

    PubMed Central

    2014-01-01

    Background The NCBI Entrez Gene and PubMed databases contain a wealth of high-quality information about genes for many different organisms. The NCBI Entrez online web-search interface is convenient for simple manual search for a small number of genes but impractical for the kinds of outputs seen in typical genomics projects. Results We have developed an efficient open source tool implemented in Python called Annokey, which annotates gene lists with the results of a keyword search of the NCBI Entrez Gene database and linked Pubmed article information. The user steers the search by specifying a ranked list of keywords (including multi-word phrases and regular expressions) that are correlated with their topic of interest. Rank information of matched terms allows the user to guide further investigation. We applied Annokey to the entire human Entrez Gene database using the key-term “DNA repair” and assessed its performance in identifying the 176 members of a published “gold standard” list of genes established to be involved in this pathway. For this test case we observed a sensitivity and specificity of 97% and 96%, respectively. Conclusions Annokey facilitates the identification of genes related to an area of interest, a task which can be onerous if performed manually on a large number of genes. Annokey provides a way to capitalize on the high quality information provided by the Entrez Gene database allowing both scalability and compatibility with automated analysis pipelines, thus offering the potential to significantly enhance research productivity.

  7. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  8. Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function

    PubMed Central

    Costello, James C; Dalkilic, Mehmet M; Beason, Scott M; Gehlhausen, Jeff R; Patwardhan, Rupali; Middha, Sumit; Eads, Brian D; Andrews, Justen R

    2009-01-01

    Background Discovering the functions of all genes is a central goal of contemporary biomedical research. Despite considerable effort, we are still far from achieving this goal in any metazoan organism. Collectively, the growing body of high-throughput functional genomics data provides evidence of gene function, but remains difficult to interpret. Results We constructed the first network of functional relationships for Drosophila melanogaster by integrating most of the available, comprehensive sets of genetic interaction, protein-protein interaction, and microarray expression data. The complete integrated network covers 85% of the currently known genes, which we refined to a high confidence network that includes 20,000 functional relationships among 5,021 genes. An analysis of the network revealed a remarkable concordance with prior knowledge. Using the network, we were able to infer a set of high-confidence Gene Ontology biological process annotations on 483 of the roughly 5,000 previously unannotated genes. We also show that this approach is a means of inferring annotations on a class of genes that cannot be annotated based solely on sequence similarity. Lastly, we demonstrate the utility of the network through reanalyzing gene expression data to both discover clusters of coregulated genes and compile a list of candidate genes related to specific biological processes. Conclusions Here we present the the first genome-wide functional gene network in D. melanogaster. The network enables the exploration, mining, and reanalysis of experimental data, as well as the interpretation of new data. The inferred annotations provide testable hypotheses of previously uncharacterized genes. PMID:19758432

  9. Analysis and functional annotation of expressed sequence tags from the Asian longhorned beetle, Anoplophora glabripennis.

    PubMed

    Hunter, Wayne B; Smith, Michael T; Hunnicutt, Laura E

    2009-01-01

    The Asian longhorned beetle, Anoplophora glabripennis (Motschulsky) (Coleoptera: Cerambycidae), is one of the most economically and ecologically devastating forest insects to invade North America in recent years. Despite its substantial impact, limited effort has been expended to define the genetic and molecular make-up of this species. Considering the significant role played by late-stadia larvae in host tree decimation, a small-scale EST sequencing project was done using a cDNA library constructed from 5(th) -instar A. glabripennis. The resultant dataset consisted of 599 high quality ESTs that, upon assembly, yielded 381 potentially unique transcripts. Each of these transcripts was catalogued as to putative molecular function, biological process, and associated cellular component according to the Gene Ontology classification system. Using this annotated dataset, a subset of assembled sequences was identified that are putatively associated with A. glabnpennis development and metamorphosis. This work will contribute to understanding of the diverse molecular mechanisms that underlie coleopteran morphogenesis and enable the future development of novel control strategies for management of this insect pest. PMID:19619025

  10. Analysis and Functional Annotation of Expressed Sequence Tags from the Asian Longhorned Beetle, Anoplophora glabripennis

    PubMed Central

    Hunter, Wayne B.; Smith, Michael T.; Hunnicutt, Laura E.

    2009-01-01

    The Asian longhorned beetle, Anoplophora glabripennis (Motschulsky) (Coleoptera: Cerambycidae), is one of the most economically and ecologically devastating forest insects to invade North America in recent years. Despite its substantial impact, limited effort has been expended to define the genetic and molecular make-up of this species. Considering the significant role played by late-stadia larvae in host tree decimation, a small-scale EST sequencing project was done using a cDNA library constructed from 5th -instar A. glabripennis. The resultant dataset consisted of 599 high quality ESTs that, upon assembly, yielded 381 potentially unique transcripts. Each of these transcripts was catalogued as to putative molecular function, biological process, and associated cellular component according to the Gene Ontology classification system. Using this annotated dataset, a subset of assembled sequences was identified that are putatively associated with A. glabnpennis development and metamorphosis. This work will contribute to understanding of the diverse molecular mechanisms that underlie coleopteran morphogenesis and enable the future development of novel control strategies for management of this insect pest. PMID:19619025

  11. Cross-Population Joint Analysis of eQTLs: Fine Mapping and Functional Annotation

    PubMed Central

    Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger

    2015-01-01

    Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the genetic underpinnings of many complex traits at molecular level. In this paper, we present an integrative analysis approach that leverages eQTL data collected from multiple population groups. In particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistent across populations, accounting for population heterogeneity in allele frequencies and linkage disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework enables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the resolution of fine mapping of causal eQTL ii) many genes harbor multiple independent eQTLs in their cis regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs (p-value = 4.93 × 10-22). PMID:25906321

  12. Developmental gene discovery in a hemimetabolous insect: de novo assembly and annotation of a transcriptome for the cricket Gryllus bimaculatus.

    PubMed

    Zeng, Victor; Ewen-Campen, Ben; Horch, Hadley W; Roth, Siegfried; Mito, Taro; Extavour, Cassandra G

    2013-01-01

    Most genomic resources available for insects represent the Holometabola, which are insects that undergo complete metamorphosis like beetles and flies. In contrast, the Hemimetabola (direct developing insects), representing the basal branches of the insect tree, have very few genomic resources. We have therefore created a large and publicly available transcriptome for the hemimetabolous insect Gryllus bimaculatus (cricket), a well-developed laboratory model organism whose potential for functional genetic experiments is currently limited by the absence of genomic resources. cDNA was prepared using mRNA obtained from adult ovaries containing all stages of oogenesis, and from embryo samples on each day of embryogenesis. Using 454 Titanium pyrosequencing, we sequenced over four million raw reads, and assembled them into 21,512 isotigs (predicted transcripts) and 120,805 singletons with an average coverage per base pair of 51.3. We annotated the transcriptome manually for over 400 conserved genes involved in embryonic patterning, gametogenesis, and signaling pathways. BLAST comparison of the transcriptome against the NCBI non-redundant protein database (nr) identified significant similarity to nr sequences for 55.5% of transcriptome sequences, and suggested that the transcriptome may contain 19,874 unique transcripts. For predicted transcripts without significant similarity to known sequences, we assessed their similarity to other orthopteran sequences, and determined that these transcripts contain recognizable protein domains, largely of unknown function. We created a searchable, web-based database to allow public access to all raw, assembled and annotated data. This database is to our knowledge the largest de novo assembled and annotated transcriptome resource available for any hemimetabolous insect. We therefore anticipate that these data will contribute significantly to more effective and higher-throughput deployment of molecular analysis tools in Gryllus. PMID:23671567

  13. Developmental Gene Discovery in a Hemimetabolous Insect: De Novo Assembly and Annotation of a Transcriptome for the Cricket Gryllus bimaculatus

    PubMed Central

    Zeng, Victor; Ewen-Campen, Ben; Horch, Hadley W.; Roth, Siegfried; Mito, Taro; Extavour, Cassandra G.

    2013-01-01

    Most genomic resources available for insects represent the Holometabola, which are insects that undergo complete metamorphosis like beetles and flies. In contrast, the Hemimetabola (direct developing insects), representing the basal branches of the insect tree, have very few genomic resources. We have therefore created a large and publicly available transcriptome for the hemimetabolous insect Gryllus bimaculatus (cricket), a well-developed laboratory model organism whose potential for functional genetic experiments is currently limited by the absence of genomic resources. cDNA was prepared using mRNA obtained from adult ovaries containing all stages of oogenesis, and from embryo samples on each day of embryogenesis. Using 454 Titanium pyrosequencing, we sequenced over four million raw reads, and assembled them into 21,512 isotigs (predicted transcripts) and 120,805 singletons with an average coverage per base pair of 51.3. We annotated the transcriptome manually for over 400 conserved genes involved in embryonic patterning, gametogenesis, and signaling pathways. BLAST comparison of the transcriptome against the NCBI non-redundant protein database (nr) identified significant similarity to nr sequences for 55.5% of transcriptome sequences, and suggested that the transcriptome may contain 19,874 unique transcripts. For predicted transcripts without significant similarity to known sequences, we assessed their similarity to other orthopteran sequences, and determined that these transcripts contain recognizable protein domains, largely of unknown function. We created a searchable, web-based database to allow public access to all raw, assembled and annotated data. This database is to our knowledge the largest de novo assembled and annotated transcriptome resource available for any hemimetabolous insect. We therefore anticipate that these data will contribute significantly to more effective and higher-throughput deployment of molecular analysis tools in Gryllus. PMID:23671567

  14. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance or environmental stress, and will facilitate studies on population genetics for psocids, as well as providing useful information for functional genomic research in the future. PMID:24244605

  15. NuChart: an R package to study gene spatial neighbourhoods with multi-omics annotations.

    PubMed

    Merelli, Ivan; Liò, Pietro; Milanesi, Luciano

    2013-01-01

    Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart. PMID:24069388

  16. Involving undergraduates in the annotation and analysis of global gene expression studies: creation of a maize shoot apical meristem expression database.

    PubMed

    Buckner, Brent; Beck, Jon; Browning, Kate; Fritz, Ashleigh; Grantham, Lisa; Hoxha, Eneda; Kamvar, Zhian; Lough, Ashley; Nikolova, Olga; Schnable, Patrick S; Scanlon, Michael J; Janick-Buckner, Diane

    2007-06-01

    Through a multi-university and interdisciplinary project we have involved undergraduate biology and computer science research students in the functional annotation of maize genes and the analysis of their microarray expression patterns. We have created a database to house the results of our functional annotation of >4400 genes identified as being differentially regulated in the maize shoot apical meristem (SAM). This database is located at http://sam.truman.edu and is now available for public use. The undergraduate students involved in constructing this unique SAM database received hands-on training in an intellectually challenging environment, which has prepared them for graduate and professional careers in biological sciences. We describe our experiences with this project as a model for effective research-based teaching of undergraduate biology and computer science students, as well as for a rich professional development experience for faculty at predominantly undergraduate institutions. PMID:17409087

  17. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complimented by transcriptome information that will enhance genome assembly and annotation. Previously, we reported a transcriptome reference sequence using a 19X coverage of Sanger and 454-pyrosequencing dat...

  18. Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae)1

    PubMed Central

    Roschanski, Anna M.; Fady, Bruno; Ziegenhagen, Birgit; Liepelt, Sascha

    2013-01-01

    • Premise of the study: We present a protocol for the annotation of transcriptome sequence data and the identification of candidate genes therein using the example of the nonmodel conifer Abies alba. • Methods and Results: A normalized cDNA library was built from an A. alba seedling. The sequencing on a 454 platform yielded more than 1.5 million reads that were de novo assembled into 25149 contigs. Two complementary approaches were applied to annotate gene fragments that code for (1) well-known proteins and (2) proteins that are potentially adaptively relevant. Primer development and testing yielded 88 amplicons that could successfully be resequenced from genomic DNA. • Conclusions: The annotation workflow offers an efficient way to identify potential adaptively relevant genes from the large quantity of transcriptome sequence data. The primer set presented should be prioritized for single-nucleotide polymorphism detection in adaptively relevant genes in A. alba. PMID:25202477

  19. Discovery of germline-related genes in Cephalochordate amphioxus: A genome wide survey using genome annotation and transcriptome data.

    PubMed

    Yue, Jia-Xing; Li, Kun-Lung; Yu, Jr-Kai

    2015-12-01

    The generation of germline cells is a critical process in the reproduction of multicellular organisms. Studies in animal models have identified a common repertoire of genes that play essential roles in primordial germ cell (PGC) formation. However, comparative studies also indicate that the timing and regulation of this core genetic program vary considerably in different animals, raising the intriguing questions regarding the evolution of PGC developmental mechanisms in metazoans. Cephalochordates (commonly called amphioxus or lancelets) represent one of the invertebrate chordate groups and can provide important information about the evolution of developmental mechanisms in the chordate lineage. In this study, we used genome and transcriptome data to identify germline-related genes in two distantly related cephalochordate species, Branchiostoma floridae and Asymmetron lucayanum. Branchiostoma and Asymmetron diverged more than 120 MYA, and the most conspicuous difference between them is their gonadal morphology. We used important germline developmental genes in several model animals to search the amphioxus genome and transcriptome dataset for conserved homologs. We also annotated the assembled transcriptome data using Gene Ontology (GO) terms to facilitate the discovery of putative genes associated with germ cell development and reproductive functions in amphioxus. We further confirmed the expression of 14 genes in developing oocytes or mature eggs using whole mount in situ hybridization, suggesting their potential functions in amphioxus germ cell development. The results of this global survey provide a useful resource for testing potential functions of candidate germline-related genes in cephalochordates and for investigating differences in gonad developmental mechanisms between Branchiostoma and Asymmetron species. PMID:25847029

  20. BIOFILTER AS A FUNCTIONAL ANNOTATION PIPELINE FOR COMMON AND RARE COPY NUMBER BURDEN.

    PubMed

    Kim, Dokyoon; Lucas, Anastasia; Glessner, Joseph; Verma, Shefali S; Bradford, Yuki; Li, Ruowang; Frase, Alex T; Hakonarson, Hakon; Peissig, Peggy; Brilliant, Murray; Ritchie, Marylyn D

    2016-01-01

    Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter - a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record - total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the interpretability of copy number burden analysis where hundreds of loci or genes contribute toward disease susceptibility via biological knowledge groups such as pathways. This CNV annotation pipeline with Biofilter can be used for CNV data from any genotyping or sequencing platform and to explore CNV enrichment for any traits or phenotypes. Biofilter continues to be a powerful bioinformatics tool for annotating, filtering, and constructing biologically informed models for association analysis - now including copy number variants. PMID:26776200

  1. BIOFILTER AS A FUNCTIONAL ANNOTATION PIPELINE FOR COMMON AND RARE COPY NUMBER BURDEN

    PubMed Central

    KIM, DOKYOON; LUCAS, ANASTASIA; GLESSNER, JOSEPH; VERMA, SHEFALI S.; BRADFORD, YUKI; LI, RUOWANG; FRASE, ALEX T.; HAKONARSON, HAKON; PEISSIG, PEGGY; BRILLIANT, MURRAY; RITCHIE, MARYLYN D.

    2015-01-01

    Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter – a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record – total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the interpretability of copy number burden analysis where hundreds of loci or genes contribute toward disease susceptibility via biological knowledge groups such as pathways. This CNV annotation pipeline with Biofilter can be used for CNV data from any genotyping or sequencing platform and to explore CNV enrichment for any traits or phenotypes. Biofilter continues to be a powerful bioinformatics tool for annotating, filtering, and constructing biologically informed models for association analysis – now including copy number variants. PMID:26776200

  2. Using The ENCODE Resource For Functional Annotation Of Genetic Variants

    PubMed Central

    Pazin, Michael J.

    2015-01-01

    Summary This article illustrates the use of the Encyclopedia of DNA Elements (ENCODE) resource to generate or refine hypotheses from genomic data on disease and other phenotypic traits. First, the goals and history of ENCODE and related epigenomics projects are reviewed. Second, the rationale for ENCODE and the major data types used by ENCODE are briefly described, as are some standard heuristics for their interpretation. Third, the use of the ENCODE resource is examined. Standard use cases for ENCODE, accessing the ENCODE resource, and accessing data from related projects are discussed. Finally, access to resources from ENCODE and related epigenomics projects are reviewed. (Although the focus of this article is the use of ENCODE data, some of the same approaches can be used with the data from other projects.) While this article is focused on the case of interpreting genetic variation data, essentially the same approaches can be used with the ENCODE resource, or with data from other projects, to interpret epigenomic and gene regulation data, with appropriate modification (Rakyan et al. 2011; Ng et al. 2012). Such approaches could allow investigators to use genomic methods to study environmental and stochastic processes, in addition to genetic processes. PMID:25762420

  3. Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers

    PubMed Central

    2012-01-01

    Background Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. Results We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. Conclusions This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest. PMID:23036012

  4. SNPit: a federated data integration system for the purpose of functional SNP annotation.

    PubMed

    Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter

    2009-08-01

    Genome wide association studies can potentially identify the genetic causes behind the majority of human diseases. With the advent of more advanced genotyping techniques, there is now an explosion of data gathered on single nucleotide polymorphisms (SNPs). The need exists for an integrated system that can provide up-to-date functional annotation information on SNPs. We have developed the SNP Integration Tool (SNPit) system to address this need. Built upon a federated data integration system, SNPit provides current information on a comprehensive list of SNP data sources. Additional logical inference analysis was included through an inference engine plug in. The SNPit web servlet is available online for use. SNPit allows users to go to one source for up-to-date information on the functional annotation of SNPs. A tool that can help to integrate and analyze the potential functional significance of SNPs is important for understanding the results from genome wide association studies. PMID:19327864

  5. Coordinated international action to accelerate genome-to-phenome with FAANG, The Functional Annotation of Animal Genomes project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  6. Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome

    PubMed Central

    Aubourg, Sébastien; Martin-Magniette, Marie-Laure; Brunaud, Véronique; Taconnat, Ludivine; Bitton, Frédérique; Balzergue, Sandrine; Jullien, Pauline E; Ingouff, Mathieu; Thareau, Vincent; Schiex, Thomas; Lecharny, Alain; Renou, Jean-Pierre

    2007-01-01

    Background Since the finishing of the sequencing of the Arabidopsis thaliana genome, the Arabidopsis community and the annotator centers have been working on the improvement of gene annotation at the structural and functional levels. In this context, we have used the large CATMA resource on the Arabidopsis transcriptome to search for genes missed by different annotation processes. Probes on the CATMA microarrays are specific gene sequence tags (GSTs) based on the CDS models predicted by the Eugene software. Among the 24 576 CATMA v2 GSTs, 677 are in regions considered as intergenic by the TAIR annotation. We analyzed the cognate transcriptome data in the CATMA resource and carried out data-mining to characterize novel genes and improve gene models. Results The statistical analysis of the results of more than 500 hybridized samples distributed among 12 organs provides an experimental validation for 465 novel genes. The hybridization evidence was confirmed by RT-PCR approaches for 88% of the 465 novel genes. Comparisons with the current annotation show that these novel genes often encode small proteins, with an average size of 137 aa. Our approach has also led to the improvement of pre-existing gene models through both the extension of 16 CDS and the identification of 13 gene models erroneously constituted of two merged CDS. Conclusion This work is a noticeable step forward in the improvement of the Arabidopsis genome annotation. We increased the number of Arabidopsis validated genes by 465 novel transcribed genes to which we associated several functional annotations such as expression profiles, sequence conservation in plants, cognate transcripts and protein motifs. PMID:17980019

  7. Protein intrinsic disorder within the Potyvirus genus: from proteome-wide analysis to functional annotation.

    PubMed

    Charon, Justine; Theil, Sébastien; Nicaise, Valérie; Michon, Thierry

    2016-01-26

    Within proteins, intrinsically disordered regions (IDRs) are devoid of stable secondary and tertiary structures under physiological conditions and rather exist as dynamic ensembles of inter-converting conformers. Although ubiquitous in all domains of life, the intrinsic disorder content is highly variable in viral genomes. Over the years, functional annotations of disordered regions at the scale of the whole proteome have been conducted for several animal viruses. But to date, similar studies applied to plant viruses are still missing. Based on disorder prediction tools combined with annotation programs and evolutionary studies, we analyzed the intrinsic disorder content in Potyvirus, using a 10-species dataset representative of this genus diversity. In this paper, we revealed that: (i) the Potyvirus proteome displays high disorder content, (ii) disorder is conserved during Potyvirus evolution, suggesting a functional advantage of IDRs, (iii) IDRs evolve faster than ordered regions, and (iv) IDRs may be associated with major biological functions required for the Potyvirus cycle. Notably, the proteins P1, Coat protein (CP) and Viral genome-linked protein (VPg) display a high content of conserved disorder, enriched in specific motifs mimicking eukaryotic functional modules and suggesting strategies of host machinery hijacking. In these three proteins, IDRs are particularly conserved despite their high amino acid polymorphism, indicating a link to adaptive processes. Through this comprehensive study, we further investigate the biological relevance of intrinsic disorder in Potyvirus biology and we propose a functional annotation of potyviral proteome IDRs. PMID:26699268

  8. Estimating the annotation error rate of curated GO database sequence annotations

    PubMed Central

    Jones, Craig E; Brown, Alfred L; Baumann, Ute

    2007-01-01

    Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information. PMID:17519041

  9. The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

    PubMed Central

    Halperin, Inbal; Glazer, Dariya S; Wu, Shirley; Altman, Russ B

    2008-01-01

    Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts. PMID:18831785

  10. Generation, analysis and functional annotation of expressed sequence tags from the sheepshead minnow (Cyprinodon variegatus)

    PubMed Central

    2010-01-01

    Background Sheepshead minnow (Cyprinodon variegatus) are small fish capable of withstanding exposure to very low levels of dissolved oxygen, as well as extreme temperatures and salinities. It is an important model in understanding the impacts and biological response to hypoxia and co-occurring compounding stressors such as polycyclic aromatic hydrocarbons, endocrine disrupting chemicals, metals and herbicides. Here, we initiated a project to sequence and analyze over 10,000 ESTs generated from the Sheepshead minnow (Cyprinodon variegatus) as a resource for investigating stressor responses. Results We sequenced 10,858 EST clones using a normalized cDNA library made from larval, embryonic and adult suppression subtractive hybridization-PCR (SSH) libraries. Post- sequencing processing led to 8,099 high quality sequences. Clustering analysis of these ESTs indentified 4,223 unique sequences containing 1,053 contigs and 3,170 singletons. BLASTX searches produced 1,394 significant (E-value < 10-5) hits and further Gene Ontology (GO) analysis annotated 388 of these genes. All the EST sequences were deposited by Expressed Sequence Tags database (dbEST) in GenBank (GenBank: GE329585 to GE337683). Gene discovery and annotations are presented and discussed. This set of ESTs represents a significant proportion of the Sheepshead minnow (Cyprinodon variegatus) transcriptome, and provides a material basis for the development of microarrays useful for further gene expression studies in association with stressors such as hypoxia, cadmium, chromium and pyrene. PMID:21047385

  11. Functional Gene Networks: R/Bioc package to generate and analyse gene networks derived from functional enrichment and clustering

    PubMed Central

    Aibar, Sara; Fontanillo, Celia; Droste, Conrad; De Las Rivas, Javier

    2015-01-01

    Summary: Functional Gene Networks (FGNet) is an R/Bioconductor package that generates gene networks derived from the results of functional enrichment analysis (FEA) and annotation clustering. The sets of genes enriched with specific biological terms (obtained from a FEA platform) are transformed into a network by establishing links between genes based on common functional annotations and common clusters. The network provides a new view of FEA results revealing gene modules with similar functions and genes that are related to multiple functions. In addition to building the functional network, FGNet analyses the similarity between the groups of genes and provides a distance heatmap and a bipartite network of functionally overlapping genes. The application includes an interface to directly perform FEA queries using different external tools: DAVID, GeneTerm Linker, TopGO or GAGE; and a graphical interface to facilitate the use. Availability and implementation: FGNet is available in Bioconductor, including a tutorial. URL: http://bioconductor.org/packages/release/bioc/html/FGNet.html Contact: jrivas@usal.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25600944

  12. Emerging applications of read profiles towards the functional annotation of the genome

    PubMed Central

    Pundhir, Sachin; Poirazi, Panayiota; Gorodkin, Jan

    2015-01-01

    Functional annotation of the genome is important to understand the phenotypic complexity of various species. The road toward functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS) data. HTS data is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification of patterns into functional groups. In this review, we highlight the emerging applications of read profiles for the annotation of non-coding RNA and cis-regulatory elements (CREs) such as enhancers and promoters. We also discuss the biological rationale behind their formation. PMID:26042150

  13. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.

    PubMed

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org. PMID:24715220

  14. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  15. Metalloproteomics: high-throughput structural and functional annotation of proteins in structural genomics.

    PubMed

    Shi, Wuxian; Zhan, Chenyang; Ignatov, Alexander; Manjasetty, Babu A; Marinkovic, Nebojsa; Sullivan, Michael; Huang, Raymond; Chance, Mark R

    2005-10-01

    A high-throughput method for measuring transition metal content based on quantitation of X-ray fluorescence signals was used to analyze 654 proteins selected as targets by the New York Structural GenomiX Research Consortium. Over 10% showed the presence of transition metal atoms in stoichiometric amounts; these totals as well as the abundance distribution are similar to those of the Protein Data Bank. Bioinformatics analysis of the identified metalloproteins in most cases supported the metalloprotein annotation; identification of the conserved metal binding motif was also shown to be useful in verifying structural models of the proteins. Metalloproteomics provides a rapid structural and functional annotation for these sequences and is shown to be approximately 95% accurate in predicting the presence or absence of stoichiometric metal content. The project's goal is to assay at least 1 member from each Pfam family; approximately 500 Pfam families have been characterized with respect to transition metal content so far. PMID:16216579

  16. Transcriptional dynamics of the developing sweet cherry (Prunus avium L.) fruit: sequencing, annotation and expression profiling of exocarp-associated genes

    PubMed Central

    Alkio, Merianne; Jonas, Uwe; Declercq, Myriam; Van Nocker, Steven; Knoche, Moritz

    2014-01-01

    The exocarp, or skin, of fleshy fruit is a specialized tissue that protects the fruit, attracts seed dispersing fruit eaters, and has large economical relevance for fruit quality. Development of the exocarp involves regulated activities of many genes. This research analyzed global gene expression in the exocarp of developing sweet cherry (Prunus avium L., ‘Regina’), a fruit crop species with little public genomic resources. A catalog of transcript models (contigs) representing expressed genes was constructed from de novo assembled short complementary DNA (cDNA) sequences generated from developing fruit between flowering and maturity at 14 time points. Expression levels in each sample were estimated for 34 695 contigs from numbers of reads mapping to each contig. Contigs were annotated functionally based on BLAST, gene ontology and InterProScan analyses. Coregulated genes were detected using partitional clustering of expression patterns. The results are discussed with emphasis on genes putatively involved in cuticle deposition, cell wall metabolism and sugar transport. The high temporal resolution of the expression patterns presented here reveals finely tuned developmental specialization of individual members of gene families. Moreover, the de novo assembled sweet cherry fruit transcriptome with 7760 full-length protein coding sequences and over 20 000 other, annotated cDNA sequences together with their developmental expression patterns is expected to accelerate molecular research on this important tree fruit crop. PMID:26504533

  17. Genome Annotation by Shotgun Inactivation of a Native Gene in Hemizygous Cells: Application to BRCA2 with Implication of Hypomorphic Variants

    PubMed Central

    Ghosh, Soma; Bhunia, Anil K.; Paun, Bogdan C.; Gilbert, Samuel F.; Dhru, Urmil; Patel, Kalpesh; Kern, Scott E.

    2015-01-01

    The greatest interpretive challenge of modern medicine may be to functionally annotate the vast variation of human genomes. Demonstrating a proposed approach, we created a library of BRCA2 exon 27 shotgun-mutant plasmids including solitary and multiplex mutations to generate human knockin clones using homologous recombination. This 55-mutation, 13-clone syngeneic variance library (SyVaL) comprised severely affected clones having early-stop nonsense mutations, functionally hypomorphic clones having multiple missense mutations emphasizing the potential to identify and assess hypomorphic mutations in novel proteomic and epidemiologic studies, and neutral clones having multiple missense mutations. Efficient coverage of nonessential amino acids was provided by mutation multiplexing. Severe mutations were distinguished from hypomorphic or neutral changes by chemosensitivity assays (hypersensitivity to mitomycin C and acetaldehyde), by analysis of RAD51 focus formation, and by mitotic multipolarity. A multiplex unbiased approach of generating all-human SyVaLs in medically important genes, with random mutations in native genes, would provide databases of variants that could be functionally annotated without concerns arising from exogenous cDNA constructs or interspecies interactions, as a basis for subsequent proteomic domain mapping or clinical calibration if desired. Such gene-irrelevant approaches could be scaled up for multiple genes of clinical interest, providing distributable cellular libraries linked to public-shared functional databases. PMID:25451944

  18. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes.

    PubMed

    Ingolia, Nicholas T; Brar, Gloria A; Stern-Ginossar, Noam; Harris, Michael S; Talhouarne, Gaëlle J S; Jackson, Sarah E; Wills, Mark R; Weissman, Jonathan S

    2014-09-11

    Ribosome profiling suggests that ribosomes occupy many regions of the transcriptome thought to be noncoding, including 5' UTRs and long noncoding RNAs (lncRNAs). Apparent ribosome footprints outside of protein-coding regions raise the possibility of artifacts unrelated to translation, particularly when they occupy multiple, overlapping open reading frames (ORFs). Here, we show hallmarks of translation in these footprints: copurification with the large ribosomal subunit, response to drugs targeting elongation, trinucleotide periodicity, and initiation at early AUGs. We develop a metric for distinguishing between 80S footprints and nonribosomal sources using footprint size distributions, which validates the vast majority of footprints outside of coding regions. We present evidence for polypeptide production beyond annotated genes, including the induction of immune responses following human cytomegalovirus (HCMV) infection. Translation is pervasive on cytosolic transcripts outside of conserved reading frames, and direct detection of this expanded universe of translated products enables efforts at understanding how cells manage and exploit its consequences. PMID:25159147

  19. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes

    PubMed Central

    Ingolia, Nicholas T.; Brar, Gloria A.; Stern-Ginossar, Noam; Harris, Michael S.; Talhouarne, Gaëlle J. S.; Jackson, Sarah E.; Wills, Mark R.; Weissman, Jonathan S.

    2014-01-01

    SUMMARY Ribosome profiling suggests that ribosomes occupy many regions of the transcriptome thought to be non-coding, including 5′ UTRs and lncRNAs. Apparent ribosome footprints outside of protein-coding regions raise the possibility of artifacts unrelated to translation, particularly when they occupy multiple, overlapping open reading frames (ORFs). Here we show hallmarks of translation in these footprints: co-purification with the large ribosomal subunit, response to drugs targeting elongation, trinucleotide periodicity, and initiation at early AUGs. We develop a metric for distinguishing between 80S footprints and nonribosomal sources using footprint size distributions, which validates the vast majority of footprints outside of coding regions. We present evidence for polypeptide production beyond annotated genes, including induction of immune responses following human cytomegalovirus (HCMV) infection. Translation is pervasive on cytosolic transcripts outside of conserved reading frames, and direct detection of this expanded universe of translated products enables efforts to understand how cells manage and exploit its consequences. PMID:25159147

  20. Tiling Assembly: a new tool for reference annotation-independent transcript assembly and novel gene identification by RNA-sequencing.

    PubMed

    Watanabe, Kenneth A; Homayouni, Arielle; Tufano, Tara; Lopez, Jennifer; Ringler, Patricia; Rushton, Paul; Shen, Qingxi J

    2015-10-01

    Annotation of the rice (Oryza sativa) genome has evolved significantly since release of its draft sequence, but it is far from complete. Several published transcript assembly programmes were tested on RNA-sequencing (RNA-seq) data to determine their effectiveness in identifying novel genes to improve the rice genome annotation. Cufflinks, a popular assembly software, did not identify all transcripts suggested by the RNA-seq data. Other assembly software was CPU intensive, lacked documentation, or lacked software updates. To overcome these shortcomings, a heuristic ab initio transcript assembly algorithm, Tiling Assembly, was developed to identify genes based on short read and junction alignment. Tiling Assembly was compared with Cufflinks to evaluate its gene-finding capabilities. Additionally, a pipeline was developed to eliminate false-positive gene identification due to noise or repetitive regions in the genome. By combining Tiling Assembly and Cufflinks, 767 unannotated genes were identified in the rice genome, demonstrating that combining both programmes proved highly efficient for novel gene identification. We also demonstrated that Tiling Assembly can accurately determine transcription start sites by comparing the Tiling Assembly genes with their corresponding full-length cDNA. We applied our pipeline to additional organisms and identified numerous unannotated genes, demonstrating that Tiling Assembly is an organism-independent tool for genome annotation. PMID:26341416

  1. Tiling Assembly: a new tool for reference annotation-independent transcript assembly and novel gene identification by RNA-sequencing

    PubMed Central

    Watanabe, Kenneth A.; Homayouni, Arielle; Tufano, Tara; Lopez, Jennifer; Ringler, Patricia; Rushton, Paul; Shen, Qingxi J.

    2015-01-01

    Annotation of the rice (Oryza sativa) genome has evolved significantly since release of its draft sequence, but it is far from complete. Several published transcript assembly programmes were tested on RNA-sequencing (RNA-seq) data to determine their effectiveness in identifying novel genes to improve the rice genome annotation. Cufflinks, a popular assembly software, did not identify all transcripts suggested by the RNA-seq data. Other assembly software was CPU intensive, lacked documentation, or lacked software updates. To overcome these shortcomings, a heuristic ab initio transcript assembly algorithm, Tiling Assembly, was developed to identify genes based on short read and junction alignment. Tiling Assembly was compared with Cufflinks to evaluate its gene-finding capabilities. Additionally, a pipeline was developed to eliminate false-positive gene identification due to noise or repetitive regions in the genome. By combining Tiling Assembly and Cufflinks, 767 unannotated genes were identified in the rice genome, demonstrating that combining both programmes proved highly efficient for novel gene identification. We also demonstrated that Tiling Assembly can accurately determine transcription start sites by comparing the Tiling Assembly genes with their corresponding full-length cDNA. We applied our pipeline to additional organisms and identified numerous unannotated genes, demonstrating that Tiling Assembly is an organism-independent tool for genome annotation. PMID:26341416

  2. Functional characterization of two M42 aminopeptidases erroneously annotated as cellulases.

    PubMed

    Dutoit, Raphaël; Brandt, Nathalie; Legrain, Christianne; Bauvois, Cédric

    2012-01-01

    Several aminopeptidases of the M42 family have been described as tetrahedral-shaped dodecameric (TET) aminopeptidases. A current hypothesis suggests that these enzymes are involved, along with the tricorn peptidase, in degrading peptides produced by the proteasome. Yet the M42 family remains ill defined, as some members have been annotated as cellulases because of their homology with CelM, formerly described as an endoglucanase of Clostridium thermocellum. Here we describe the catalytic functions and substrate profiles CelM and of TmPep1050, the latter having been annotated as an endoglucanase of Thermotoga maritima. Both enzymes were shown to catalyze hydrolysis of nonpolar aliphatic L-amino acid-pNA substrates, the L-leucine derivative appearing as the best substrate. No significant endoglucanase activity was measured, either for TmPep1050 or CelM. Addition of cobalt ions enhanced the activity of both enzymes significantly, while both the chelating agent EDTA and bestatin, a specific inhibitor of metalloaminopeptidases, proved inhibitory. Our results strongly suggest that one should avoid annotating members of the M42 aminopeptidase family as cellulases. In an updated assessment of the distribution of M42 aminopeptidases, we found TET aminopeptidases to be distributed widely amongst archaea and bacteria. We additionally observed that several phyla lack both TET and tricorn. This suggests that other complexes may act downstream from the proteasome. PMID:23226342

  3. Genome-wide annotation, expression profiling, and protein interaction studies of the core cell-cycle genes in Phalaenopsis aphrodite.

    PubMed

    Lin, Hsiang-Yin; Chen, Jhun-Chen; Wei, Miao-Ju; Lien, Yi-Chen; Li, Huang-Hsien; Ko, Swee-Suak; Liu, Zin-Huang; Fang, Su-Chiung

    2014-01-01

    Orchidaceae is one of the most abundant and diverse families in the plant kingdom and its unique developmental patterns have drawn the attention of many evolutionary biologists. Particular areas of interest have included the co-evolution of pollinators and distinct floral structures, and symbiotic relationships with mycorrhizal flora. However, comprehensive studies to decipher the molecular basis of growth and development in orchids remain scarce. Cell proliferation governed by cell-cycle regulation is fundamental to growth and development of the plant body. We took advantage of recently released transcriptome information to systematically isolate and annotate the core cell-cycle regulators in the moth orchid Phalaenopsis aphrodite. Our data verified that Phalaenopsis cyclin-dependent kinase A (CDKA) is an evolutionarily conserved CDK. Expression profiling studies suggested that core cell-cycle genes functioning during the G1/S, S, and G2/M stages were preferentially enriched in the meristematic tissues that have high proliferation activity. In addition, subcellular localization and pairwise interaction analyses of various combinations of CDKs and cyclins, and of E2 promoter-binding factors and dimerization partners confirmed interactions of the functional units. Furthermore, our data showed that expression of the core cell-cycle genes was coordinately regulated during pollination-induced reproductive development. The data obtained establish a fundamental framework for study of the cell-cycle machinery in Phalaenopsis orchids. PMID:24222213

  4. Gene prediction and annotation in Penstemon (Plantaginaceae): A workflow for marker development from extremely low-coverage genome sequencing1

    PubMed Central

    Blischak, Paul D.; Wenzel, Aaron J.; Wolfe, Andrea D.

    2014-01-01

    • Premise of the study: Penstemon (Plantaginaceae) is a large and diverse genus endemic to North America. However, determining the phylogenetic relationships among its 280 species has been difficult due to its recent evolutionary radiation. The development of a large, multilocus data set can help to resolve this challenge. • Methods: Using both previously sequenced genomic libraries and our own low-coverage whole-genome shotgun sequencing libraries, we used the MAKER2 Annotation Pipeline to identify gene regions for the development of sequencing loci from six extremely low-coverage Penstemon genomes (?0.005ז0.007×). We also compared this approach to BLAST searches, and conducted analyses to characterize sequence divergence across the species sequenced. • Results: Annotations and gene predictions were successfully added to more than 10,000 contigs for potential use in downstream primer design. Primers were then designed for chloroplast, mitochondrial, and nuclear loci from these annotated sequences. MAKER2 identified longer gene regions in all six Penstemon genomes when compared with BLASTN and BLASTX searches. The average level of sequence divergence among the six species was 7.14%. • Discussion: Combining bioinformatics tools into a workflow that produces annotations can be useful for creating potential phylogenetic markers from thousands of sequences even when genome coverage is extremely low and reference data are only available from distant relatives. Furthermore, the output from MAKER2 contains information about important gene features, such as exon boundaries, and can be easily integrated with visualization tools to facilitate the process of marker development. PMID:25506519

  5. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation.

    PubMed

    Ng, Patrick; Wei, Chia-Lin; Sung, Wing-Kin; Chiu, Kuo Ping; Lipovich, Leonard; Ang, Chin Chin; Gupta, Sanjay; Shahab, Atif; Ridwan, Azmi; Wong, Chee Hong; Liu, Edison T; Ruan, Yijun

    2005-02-01

    We have developed a DNA tag sequencing and mapping strategy called gene identification signature (GIS) analysis, in which 5' and 3' signatures of full-length cDNAs are accurately extracted into paired-end ditags (PETs) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries of every gene. GIS analysis is potentially 30-fold more efficient than standard cDNA sequencing approaches for transcriptome characterization. We demonstrated this approach with 116,252 PET sequences derived from mouse embryonic stem cells. Initial analysis of this dataset identified hundreds of previously uncharacterized transcripts, including alternative transcripts of known genes. We also uncovered several intergenically spliced and unusual fusion transcripts, one of which was confirmed as a trans-splicing event and was differentially expressed. The concept of paired-end ditagging described here for transcriptome analysis can also be applied to whole-genome analysis of cis-regulatory and other DNA elements and represents an important technological advance for genome annotation. PMID:15782207

  6. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    SciTech Connect

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, III, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  7. The Gene Wiki in 2011: community intelligence applied to human gene annotation

    PubMed Central

    Good, Benjamin M.; Clarke, Erik L.; de Alfaro, Luca; Su, Andrew I.

    2012-01-01

    The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10?000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki. PMID:22075991

  8. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    PubMed

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  9. Biological functional annotation of retinoic acid alpha and beta in mouse liver based on genome-wide binding

    PubMed Central

    He, Yuqi; Tsuei, Jessica

    2014-01-01

    Retinoic acid (RA) has diverse biological effects. The liver stores vitamin A, generates RA, and expresses receptors for RA. The current study examines the hepatic binding profile of two RA receptor isoforms, RARA (RARα) and RARB (RARβ), in response to RA treatment in mouse livers. Our data uncovered 35,521, and 14,968 genomic bindings for RARA and RARB, respectively. Each expressed unique and common bindings, implying their redundant and specific roles. RARB has higher RA responsiveness than RARB. RA treatment generated 18,821 novel RARB bindings but only 14,798 of RARA bindings, compared with the control group. RAR frequently bound the consensus hormone response element [HRE; (A/G)G(G/T)TCA], which often contained the motifs assigned to SP1, GABPA, and FOXA2, suggesting potential interactions between those transcriptional factors. Functional annotation coupled with principle component analysis revealed that the function of RAR target genes were motif dependent. Taken together, the cistrome of RARA and RARB revealed their extensive biological roles in the mouse liver. RAR target genes are enriched in various biological processes. The hepatic RAR genome-wide binding data can help us understand the global molecular mechanisms underlying RAR and RA-mediated gene and pathway regulation. PMID:24833708

  10. Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy

    PubMed Central

    Kim, Wan Kyu; Krumpelman, Chase; Marcotte, Edward M

    2008-01-01

    The complete set of mouse genes, as with the set of human genes, is still largely uncharacterized, with many pieces of experimental evidence accumulating regarding the activities and expression of the genes, but the majority of genes as yet still of unknown function. Within the context of the MouseFunc competition, we developed and applied two distinct large-scale data mining approaches to infer the functions (Gene Ontology annotations) of mouse genes from experimental observations from available functional genomics, proteomics, comparative genomics, and phenotypic data. The two strategies — the first using classifiers to map features to annotations, the second propagating annotations from characterized genes to uncharacterized genes along edges in a network constructed from the features — offer alternative and possibly complementary approaches to providing functional annotations. Here, we re-implement and evaluate these approaches and their combination for their ability to predict the proper functional annotations of genes in the MouseFunc data set. We show that, when controlling for the same set of input features, the network approach generally outperformed a naïve Bayesian classifier approach, while their combination offers some improvement over either independently. We make our observations of predictive performance on the MouseFunc competition hold-out set, as well as on a ten-fold cross-validation of the MouseFunc data. Across all 1,339 annotated genes in the MouseFunc test set, the median predictive power was quite strong (median area under a receiver operating characteristic plot of 0.865 and average precision of 0.195), indicating that a mining-based strategy with existing data is a promising path towards discovering mammalian gene functions. As one product of this work, a high-confidence subset of the functional mouse gene network was produced — spanning >70% of mouse genes with >1.6 million associations — that is predictive of mouse (and therefore often human) gene function and functional associations. The network should be generally useful for mammalian gene functional analyses, such as for predicting interactions, inferring functional connections between genes and pathways, and prioritizing candidate genes. The network and all predictions are available on the worldwide web. PMID:18613949

  11. BambooGDB: a bamboo genome database with functional annotation and an analysis platform

    PubMed Central

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org PMID:24602877

  12. Reveal genes functionally associated with ACADS by a network study.

    PubMed

    Chen, Yulong; Su, Zhiguang

    2015-09-15

    Establishing a systematic network is aimed at finding essential human gene-gene/gene-disease pathway by means of network inter-connecting patterns and functional annotation analysis. In the present study, we have analyzed functional gene interactions of short-chain acyl-coenzyme A dehydrogenase gene (ACADS). ACADS plays a vital role in free fatty acid ?-oxidation and regulates energy homeostasis. Modules of highly inter-connected genes in disease-specific ACADS network are derived by integrating gene function and protein interaction data. Among the 8 genes in ACADS web retrieved from both STRING and GeneMANIA, ACADS is effectively conjoined with 4 genes including HAHDA, HADHB, ECHS1 and ACAT1. The functional analysis is done via ontological briefing and candidate disease identification. We observed that the highly efficient-interlinked genes connected with ACADS are HAHDA, HADHB, ECHS1 and ACAT1. Interestingly, the ontological aspect of genes in the ACADS network reveals that ACADS, HAHDA and HADHB play equally vital roles in fatty acid metabolism. The gene ACAT1 together with ACADS indulges in ketone metabolism. Our computational gene web analysis also predicts potential candidate disease recognition, thus indicating the involvement of ACADS, HAHDA, HADHB, ECHS1 and ACAT1 not only with lipid metabolism but also with infant death syndrome, skeletal myopathy, acute hepatic encephalopathy, Reye-like syndrome, episodic ketosis, and metabolic acidosis. The current study presents a comprehensible layout of ACADS network, its functional strategies and candidate disease approach associated with ACADS network. PMID:26045367

  13. Integrating genome annotation and QTL position to identify candidate genes for productivity, architecture and water-use efficiency in Populus spp

    PubMed Central

    2012-01-01

    Background Hybrid poplars species are candidates for biomass production but breeding efforts are needed to combine productivity and water use efficiency in improved cultivars. The understanding of the genetic architecture of growth in poplar by a Quantitative Trait Loci (QTL) approach can help us to elucidate the molecular basis of such integrative traits but identifying candidate genes underlying these QTLs remains difficult. Nevertheless, the increase of genomic information together with the accessibility to a reference genome sequence (Populus trichocarpa Nisqually-1) allow to bridge QTL information on genetic maps and physical location of candidate genes on the genome. The objective of the study is to identify QTLs controlling productivity, architecture and leaf traits in a P. deltoides x P. trichocarpa F1 progeny and to identify candidate genes underlying QTLs based on the anchoring of genetic maps on the genome and the gene ontology information linked to genome annotation. The strategy to explore genome annotation was to use Gene Ontology enrichment tools to test if some functional categories are statistically over-represented in QTL regions. Results Four leaf traits and 7 growth traits were measured on 330 F1 P. deltoides x P. trichocarpa progeny. A total of 77 QTLs controlling 11 traits were identified explaining from 1.8 to 17.2% of the variation of traits. For 58 QTLs, confidence intervals could be projected on the genome. An extended functional annotation was built based on data retrieved from the plant genome database Phytozome and from an inference of function using homology between Populus and the model plant Arabidopsis. Genes located within QTL confidence intervals were retrieved and enrichments in gene ontology (GO) terms were determined using different methods. Significant enrichments were found for all traits. Particularly relevant biological processes GO terms were identified for QTLs controlling number of sylleptic branches: intervals were enriched in GO terms of biological process like ‘ripening’ and ‘adventitious roots development’. Conclusion Beyond the simple identification of QTLs, this study is the first to use a global approach of GO terms enrichment analysis to fully explore gene function under QTLs confidence intervals in plants. This global approach may lead to identification of new candidate genes for traits of interest. PMID:23013168

  14. Coordinated and sequential transcription of the cyprinid herpesvirus-3 annotated genes.

    PubMed

    Ilouze, Maya; Dishon, Arnon; Kotler, Moshe

    2012-10-01

    Cyprinid herpesvirus-3 (CyHV-3) is the cause of a fatal disease in carp and koi fish. The disease is seasonal and appears when water temperatures range from 18 to 28°C. CyHV-3 is a member of the Alloherpesviridae, a family in the Herpesvirales order that encompasses mammalian, avian and reptilian viruses. CyHV-3 is a large double-stranded DNA (dsDNA) herpesvirus with a genome of approximately 295kbp, divergent from other mammalian, avian and reptilian herpesviruses, but bearing several genes similar to cyprinid herpesvirus-1 (CyHV-1), CyHV-2, anguillid herpesvirus-1 (AngHV-1), ictalurid herpesvirus-1 (IcHV-1) and ranid herpes virus-1 (RaHV-1). Here we show that viral DNA synthesis commences 4-8h post-infection (p.i.), and is completely inhibited by pre-treatment with cytosine β-d-arabinofuranoside (Ara-C). Transcription of CyHV-3 genes initiates after infection as early as 1-2h p.i., and precedes viral DNA synthesis. All 156 annotated open reading frames (ORFs) of the CyHV-3 genome are transcribed into RNAs, most of which can be classified into immediate early (IE or α), early (E or β) and late (L or γ) classes, similar to all other herpesviruses. Several ORFs belonging to these groups are clustered along the viral genome. PMID:22841491

  15. Meta4: a web application for sharing and annotating metagenomic gene predictions using web services

    PubMed Central

    Richardson, Emily J.; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J.; Watson, Mick

    2013-01-01

    Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website1, code is available on Github2, a cloud image is available, and an example implementation can be seen at PMID:24046776

  16. Meta4: a web application for sharing and annotating metagenomic gene predictions using web services.

    PubMed

    Richardson, Emily J; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J; Watson, Mick

    2013-01-01

    Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website, code is available on Github, a cloud image is available, and an example implementation can be seen at. PMID:24046776

  17. Warehousing re-annotated cancer genes for biomarker meta-analysis.

    PubMed

    Orsini, M; Travaglione, A; Capobianco, E

    2013-07-01

    Translational research in cancer genomics assigns a fundamental role to bioinformatics in support of candidate gene prioritization with regard to both biomarker discovery and target identification for drug development. Efforts in both such directions rely on the existence and constant update of large repositories of gene expression data and omics records obtained from a variety of experiments. Users who interactively interrogate such repositories may have problems in retrieving sample fields that present limited associated information, due for instance to incomplete entries or sometimes unusable files. Cancer-specific data sources present similar problems. Given that source integration usually improves data quality, one of the objectives is keeping the computational complexity sufficiently low to allow an optimal assimilation and mining of all the information. In particular, the scope of integrating intraomics data can be to improve the exploration of gene co-expression landscapes, while the scope of integrating interomics sources can be that of establishing genotype-phenotype associations. Both integrations are relevant to cancer biomarker meta-analysis, as the proposed study demonstrates. Our approach is based on re-annotating cancer-specific data available at the EBI's ArrayExpress repository and building a data warehouse aimed to biomarker discovery and validation studies. Cancer genes are organized by tissue with biomedical and clinical evidences combined to increase reproducibility and consistency of results. For better comparative evaluation, multiple queries have been designed to efficiently address all types of experiments and platforms, and allow for retrieval of sample-related information, such as cell line, disease state and clinical aspects. PMID:23639751

  18. Annotation extension through protein family annotation coherence metrics.

    PubMed

    Bastos, Hugo P; Clarke, Luka A; Couto, Francisco M

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  19. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  20. Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization

    PubMed Central

    Song, Baoxing; Xu, Jian; Ning, Kang

    2014-01-01

    The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale. PMID:24595159

  1. IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform.

    PubMed

    Singh, Neetu; Sahu, Dinesh Kumar; Chowdhry, Rebecca; Mishra, Archana; Goel, Madhu Mati; Faheem, Mohd; Srivastava, Chhitij; Ojha, Bal Krishna; Gupta, Devendra Kumar; Kant, Ravi

    2016-02-01

    Here, we sequenced and functionally annotated the long reads (1-2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes--PNMT, SNN and LAMB1 show their involvement in Ependymoma. PMID:26862483

  2. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

    PubMed

    O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  3. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

    PubMed Central

    O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  4. How to learn about gene function: text-mining or ontologies?

    PubMed

    Soldatos, Theodoros G; Perdigão, Nelson; Brown, Nigel P; Sabir, Kenneth S; O'Donoghue, Seán I

    2015-03-01

    As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic insight into the underlying molecular systems; (2) provide better follow-up experimental testing and treatment options, and (3) better manage gene lists derived from organisms that are not well-studied. We discuss some promising approaches that may help achieve these advances, especially the use of extended dictionaries of biomedical concepts and molecular mechanisms, as well as greater use of annotation benchmarks. PMID:25088781

  5. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.

    PubMed

    Law, MeiYee; Childs, Kevin L; Campbell, Michael S; Stein, Joshua C; Olson, Andrew J; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  6. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  7. Functional Annotation and Three-Dimensional Structure of an Incorrectly Annotated Dihydroorotase from cog3964 in the Amidohydrolase Superfamily

    PubMed Central

    Ornelas, Argentina; Korczynska, Magdalena; Ragumani, Sugadev; Kumaran, Desigan; Narindoshvili, Tamari; Shoichet, Brian K.; Swaminathan, Subramanyam; Raushel, Frank M.

    2012-01-01

    The substrate specificities of two incorrectly annotated enzymes belonging to cog3964 from the amidohydrolase superfamily (AHS) were determined. This group of enzymes is currently misannotated as either dihydroorotase or adenine deaminase. Atu3266 from Agrobacterium tumefaciens C58 and Oant2987 from Ochrobactrum anthropi ATCC 49188 were determined to catalyze the hydrolysis of acetyl-R-mandelate and similar esters with values of kcat/Km that exceed 105 M?1 s?1. These enzymes do not catalyze the deamination of adenine or the hydrolysis of dihydroorotate. Atu3266 was crystallized and the structure determined to a resolution of 2.62 Å. The protein folds as a distorted (?/?)8-barrel and binds two zincs in the active site. The substrate profile was determined via a combination of computational docking to the three-dimensional structure of Atu3266 and screening of a highly focused library of potential substrates. The initial weak hit was the hydrolysis of N-acetyl-D-serine (kcat/Km = 4 M?1s?1). This was followed by the progressive identification of acetyl-R-glycerate (4 × 102 M?1s?1), acetyl glycolate (kcat/Km = 1.3 × 104 M?1 s?1) and ultimately acetyl-R-mandelate (kcat/Km =2.8 × 105 M?1 s?1). PMID:23214420

  8. Analysis of the leaf transcriptome of Musa acuminata during interaction with Mycosphaerella musicola: gene assembly, annotation and marker development

    PubMed Central

    2013-01-01

    Background Although banana (Musa sp.) is an important edible crop, contributing towards poverty alleviation and food security, limited transcriptome datasets are available for use in accelerated molecular-based breeding in this genus. 454 GS-FLX Titanium technology was employed to determine the sequence of gene transcripts in genotypes of Musa acuminata ssp. burmannicoides Calcutta 4 and M. acuminata subgroup Cavendish cv. Grande Naine, contrasting in resistance to the fungal pathogen Mycosphaerella musicola, causal organism of Sigatoka leaf spot disease. To enrich for transcripts under biotic stress responses, full length-enriched cDNA libraries were prepared from whole plant leaf materials, both uninfected and artificially challenged with pathogen conidiospores. Results The study generated 846,762 high quality sequence reads, with an average length of 334 bp and totalling 283 Mbp. De novo assembly generated 36,384 and 35,269 unigene sequences for M. acuminata Calcutta 4 and Cavendish Grande Naine, respectively. A total of 64.4% of the unigenes were annotated through Basic Local Alignment Search Tool (BLAST) similarity analyses against public databases. Assembled sequences were functionally mapped to Gene Ontology (GO) terms, with unigene functions covering a diverse range of molecular functions, biological processes and cellular components. Genes from a number of defense-related pathways were observed in transcripts from each cDNA library. Over 99% of contig unigenes mapped to exon regions in the reference M. acuminata DH Pahang whole genome sequence. A total of 4068 genic-SSR loci were identified in Calcutta 4 and 4095 in Cavendish Grande Naine. A subset of 95 potential defense-related gene-derived simple sequence repeat (SSR) loci were validated for specific amplification and polymorphism across M. acuminata accessions. Fourteen loci were polymorphic, with alleles per polymorphic locus ranging from 3 to 8 and polymorphism information content ranging from 0.34 to 0.82. Conclusions A large set of unigenes were characterized in this study for both M. acuminata Calcutta 4 and Cavendish Grande Naine, increasing the number of public domain Musa ESTs. This transcriptome is an invaluable resource for furthering our understanding of biological processes elicited during biotic stresses in Musa. Gene-based markers will facilitate molecular breeding strategies, forming the basis of genetic linkage mapping and analysis of quantitative trait loci. PMID:23379821

  9. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation

    PubMed Central

    Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K.; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V.; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C.; Klee, Eric W.; Scaria, Vinod; Sivasubbu, Sridhar

    2014-01-01

    A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a ‘wiki’-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a ‘structured wiki’ or rather a ‘semantic wiki’. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki PMID:24578356

  10. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation.

    PubMed

    Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C; Klee, Eric W; Scaria, Vinod; Sivasubbu, Sridhar

    2014-01-01

    A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a 'wiki'-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a 'structured wiki' or rather a 'semantic wiki'. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki. PMID:24578356

  11. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    PubMed Central

    Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day “Plant Genomics and Gene Annotation” workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation. PMID:18765753

  12. Analysis and Functional Annotation of an Expressed Sequence Tag Collection for Tropical Crop Sugarcane

    PubMed Central

    Vettore, André L.; da Silva, Felipe R.; Kemper, Edson L.; Souza, Glaucia M.; da Silva, Aline M.; Ferro, Maria Inês T.; Henrique-Silva, Flavio; Giglioti, Éder A.; Lemos, Manoel V.F.; Coutinho, Luiz L.; Nobrega, Marina P.; Carrer, Helaine; França, Suzelei C.; Bacci, Maurício; Goldman, Maria Helena S.; Gomes, Suely L.; Nunes, Luiz R.; Camargo, Luis E.A.; Siqueira, Walter J.; Van Sluys, Marie-Anne; Thiemann, Otavio H.; Kuramae, Eiko E.; Santelli, Roberto V.; Marino, Celso L.; Targon, Maria L.P.N.; Ferro, Jesus A.; Silveira, Henrique C.S.; Marini, Danyelle C.; Lemos, Eliana G.M.; Monteiro-Vitorello, Claudia B.; Tambor, José H.M.; Carraro, Dirce M.; Roberto, Patrícia G.; Martins, Vanderlei G.; Goldman, Gustavo H.; de Oliveira, Regina C.; Truffi, Daniela; Colombo, Carlos A.; Rossi, Magdalena; de Araujo, Paula G.; Sculaccio, Susana A.; Angella, Aline; Lima, Marleide M.A.; de Rosa, Vicente E.; Siviero, Fábio; Coscrato, Virginia E.; Machado, Marcos A.; Grivet, Laurent; Di Mauro, Sonia M.Z.; Nobrega, Francisco G.; Menck, Carlos F.M.; Braga, Marilia D.V.; Telles, Guilherme P.; Cara, Frank A.A.; Pedrosa, Guilherme; Meidanis, João; Arruda, Paulo

    2003-01-01

    To contribute to our understanding of the genome complexity of sugarcane, we undertook a large-scale expressed sequence tag (EST) program. More than 260,000 cDNA clones were partially sequenced from 26 standard cDNA libraries generated from different sugarcane tissues. After the processing of the sequences, 237,954 high-quality ESTs were identified. These ESTs were assembled into 43,141 putative transcripts. Of the assembled sequences, 35.6% presented no matches with existing sequences in public databases. A global analysis of the whole SUCEST data set indicated that 14,409 assembled sequences (33% of the total) contained at least one cDNA clone with a full-length insert. Annotation of the 43,141 assembled sequences associated almost 50% of the putative identified sugarcane genes with protein metabolism, cellular communication/signal transduction, bioenergetics, and stress responses. Inspection of the translated assembled sequences for conserved protein domains revealed 40,821 amino acid sequences with 1415 Pfam domains. Reassembling the consensus sequences of the 43,141 transcripts revealed a 22% redundancy in the first assembling. This indicated that possibly 33,620 unique genes had been identified and indicated that >90% of the sugarcane expressed genes were tagged. PMID:14613979

  13. Functional Analysis of the Molecular Interactions of TATA Box-Containing Genes and Essential Genes

    PubMed Central

    Moon, Jisook

    2015-01-01

    Genes can be divided into TATA-containing genes and TATA-less genes according to the presence of TATA box elements at promoter regions. TATA-containing genes tend to be stress-responsive, whereas many TATA-less genes are known to be related to cell growth or “housekeeping” functions. In a previous study, we demonstrated that there are striking differences among four gene sets defined by the presence of TATA box (TATA-containing) and essentiality (TATA-less) with respect to number of associated transcription factors, amino acid usage, and functional annotation. Extending this research in yeast, we identified KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways that are statistically enriched in TATA-containing or TATA-less genes and evaluated the possibility that the enriched pathways are related to stress or growth as reflected by the individual functions of the genes involved. According to their enrichment for either of these two gene sets, we sorted KEGG pathways into TATA-containing-gene-enriched pathways (TEPs) and essential-gene-enriched pathways (EEPs). As expected, genes in TEPs and EEPs exhibited opposite results in terms of functional category, transcriptional regulation, codon adaptation index, and network properties, suggesting the possibility that the bipolar patterns in these pathways also contribute to the regulation of the stress response and to cell survival. Our findings provide the novel insight that significant enrichment of TATA-binding or TATA-less genes defines pathways as stress-responsive or growth-related. PMID:25789484

  14. Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20

    PubMed Central

    Shahbaaz, Mohd; Md. ImtaiyazHassan; Ahmad, Faizan

    2013-01-01

    Haemophilus influenzae is a Gram negative bacterium that belongs to the family Pasteurellaceae, causes bacteremia, pneumonia and acute bacterial meningitis in infants. The emergence of multi-drug resistance H. influenzae strain in clinical isolates demands the development of better/new drugs against this pathogen. Our study combines a number of bioinformatics tools for function predictions of previously not assigned proteins in the genome of H. influenzae. This genome was extensively analyzed and found 1,657 functional proteins in which function of 429 proteins are unknown, termed as hypothetical proteins (HPs). Amino acid sequences of all 429 HPs were extensively annotated and we successfully assigned the function to 296 HPs with high confidence. We also characterized the function of 124 HPs precisely, but with less confidence. We believed that sequence of a protein can be used as a framework to explain known functional properties. Here we have combined the latest versions of protein family databases, protein motifs, intrinsic features from the amino acid sequence, pathway and genome context methods to assign a precise function to hypothetical proteins for which no experimental information is available. We found these HPs belong to various classes of proteins such as enzymes, transporters, carriers, receptors, signal transducers, binding proteins, virulence and other proteins. The outcome of this work will be helpful for a better understanding of the mechanism of pathogenesis and in finding novel therapeutic targets for H. influenzae. PMID:24391926

  15. IsoSeq analysis and functional annotation of the infratentorial ependymoma tumor tissue on PacBio RSII platform

    PubMed Central

    Singh, Neetu; Sahu, Dinesh Kumar; Chowdhry, Rebecca; Mishra, Archana; Goel, Madhu Mati; Faheem, Mohd; Srivastava, Chhitij; Ojha, Bal Krishna; Gupta, Devendra Kumar; Kant, Ravi

    2015-01-01

    Here, we sequenced and functionally annotated the long reads (1–2 kb) cDNAs library of an infratentorial ependymoma tumor tissue on PacBio RSII by Iso-Seq protocol using SMRT technology. 577 MB, data was generated from the brain tissues of ependymoma tumor patient, producing 1,19,313 high-quality reads assembled into 19,878 contigs using Celera assembler followed by Quiver pipelines, which produced 2952 unique protein accessions in the nr protein database and 307 KEGG pathways. Additionally, when we compared GO terms of second and third level with alternative splicing data obtained through HTA Array2.0. We identified four and twelve transcript cluster IDs in Level-2 and Level-3 scores respectively with alternative splicing index predicting mainly the major pathways of hallmarks of cancer. Out of these transcript cluster IDs only transcript cluster IDs of gene PNMT, SNN and LAMB1 showed Reads Per Kilobase of exon model per Million mapped reads (RPKM) values at gene-level expression (GE) and transcript-level (TE) track. Most importantly, brain-specific genes–—PNMT, SNN and LAMB1 show their involvement in Ependymoma. PMID:26862483

  16. The relationship between protein sequences and their gene ontology functions

    PubMed Central

    Duan, Zhong-Hui; Hughes, Brent; Reichel, Lothar; Perez, Dianne M; Shi, Ting

    2006-01-01

    Background One main research challenge in the post-genomic era is to understand the relationship between protein sequences and their biological functions. In recent years, several automated annotation systems have been developed for the functional assignment of uncharacterized proteins. The underlying assumption of these systems is that similar sequences imply similar biological functions. However, it has been noted that matching sequences do not always infer similar functions. Results In this paper, we present the correlation between protein sequences and protein functions for the yeast proteome in the context of gene ontology. A novel measure is introduced to define the overall similarity between two protein sequences. The effects of the level as well as the size of a gene ontology group on the degree of similarity were studied. The similarity distributions at different levels of gene ontology trees are presented. To evaluate the theoretical prediction power of similar sequences, we computed the posterior probability of correct predictions. Conclusion The results indicate that protein pairs of similar biological functions tend to have higher sequence similarity, although the similarity distribution in each functional group is heterogeneous and varies from group to group. We conclude that sequence similarity can serve as a key measure in protein function prediction. However, the resulting annotations must be verified through other means. A method that combines a broader range of measures is more likely to provide more accurate prediction. Our study indicates that the posterior probability of a correct prediction could serve as one of the key measures. PMID:17217503

  17. Gene Set Enrichment in eQTL Data Identifies Novel Annotations and Pathway Regulators

    PubMed Central

    Wu, Chunlei; Delano, David L.; Mitro, Nico; Su, Stephen V.; Janes, Jeff; McClurg, Phillip; Batalov, Serge; Welch, Genevieve L.; Zhang, Jie; Orth, Anthony P.; Walker, John R.; Glynne, Richard J.; Cooke, Michael P.; Takahashi, Joseph S.; Shimomura, Kazuhiro; Kohsaka, Akira; Bass, Joseph; Saez, Enrique; Wiltshire, Tim; Su, Andrew I.

    2008-01-01

    Genome-wide gene expression profiling has been extensively used to generate biological hypotheses based on differential expression. Recently, many studies have used microarrays to measure gene expression levels across genetic mapping populations. These gene expression phenotypes have been used for genome-wide association analyses, an analysis referred to as expression QTL (eQTL) mapping. Here, eQTL analysis was performed in adipose tissue from 28 inbred strains of mice. We focused our analysis on “trans-eQTL bands”, defined as instances in which the expression patterns of many genes were all associated to a common genetic locus. Genes comprising trans-eQTL bands were screened for enrichments in functional gene sets representing known biological pathways, and genes located at associated trans-eQTL band loci were considered candidate transcriptional modulators. We demonstrate that these patterns were enriched for previously characterized relationships between known upstream transcriptional regulators and their downstream target genes. Moreover, we used this strategy to identify both novel regulators and novel members of known pathways. Finally, based on a putative regulatory relationship identified in our analysis, we identified and validated a previously uncharacterized role for cyclin H in the regulation of oxidative phosphorylation. We believe that the specific molecular hypotheses generated in this study will reveal many additional pathway members and regulators, and that the analysis approaches described herein will be broadly applicable to other eQTL data sets. PMID:18464898

  18. Transcriptome Analysis of the Emerald Ash Borer (EAB), Agrilus planipennis: De Novo Assembly, Functional Annotation and Comparative Analysis

    PubMed Central

    Duan, Jun; Ladd, Tim; Doucet, Daniel; Cusson, Michel; vanFrankenhuyzen, Kees; Mittapalli, Omprakash; Krell, Peter J.; Quan, Guoxing

    2015-01-01

    Background The Emerald ash borer (EAB), Agrilus planipennis, is an invasive phloem-feeding insect pest of ash trees. Since its initial discovery near the Detroit, US- Windsor, Canada area in 2002, the spread of EAB has had strong negative economic, social and environmental impacts in both countries. Several transcriptomes from specific tissues including midgut, fat body and antenna have recently been generated. However, the relatively low sequence depth, gene coverage and completeness limited the usefulness of these EAB databases. Methodology and Principal Findings High-throughput deep RNA-Sequencing (RNA-Seq) was used to obtain 473.9 million pairs of 100 bp length paired-end reads from various life stages and tissues. These reads were assembled into 88,907 contigs using the Trinity strategy and integrated into 38,160 unigenes after redundant sequences were removed. We annotated 11,229 unigenes by searching against the public nr, Swiss-Prot and COG. The EAB transcriptome assembly was compared with 13 other sequenced insect species, resulting in the prediction of 536 unigenes that are Coleoptera-specific. Differential gene expression revealed that 290 unigenes are expressed during larval molting and 3,911 unigenes during metamorphosis from larvae to pupae, respectively (FDR< 0.01 and log2 FC>2). In addition, 1,167 differentially expressed unigenes were identified from larval and adult midguts, 435 unigenes were up-regulated in larval midgut and 732 unigenes were up-regulated in adult midgut. Most of the genes involved in RNA interference (RNAi) pathways were identified, which implies the existence of a system RNAi in EAB. Conclusions and Significance This study provides one of the most fundamental and comprehensive transcriptome resources available for EAB to date. Identification of the tissue- stage- or species- specific unigenes will benefit the further study of gene functions during growth and metamorphosis processes in EAB and other pest insects. PMID:26244979

  19. Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions

    PubMed Central

    2013-01-01

    Background The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant “flagship” genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the http://www.cosmoss.org model organism database. Conclusions Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5’-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes. PMID:23879659

  20. Improving Functional Annotation in the DRE-TIM Metallolyase Superfamily through Identification of Active Site Fingerprints.

    PubMed

    Kumar, Garima; Johnson, Jordyn L; Frantom, Patrick A

    2016-03-29

    Within the DRE-TIM metallolyase superfamily, members of the Claisen-like condensation (CC-like) subgroup catalyze C-C bond-forming reactions between various α-ketoacids and acetyl-coenzyme A. These reactions are important in the metabolic pathways of many bacterial pathogens and serve as engineering scaffolds for the production of long-chain alcohol biofuels. To improve functional annotation and identify sequences that might use novel substrates in the CC-like subgroup, a combination of structural modeling and multiple-sequence alignments identified active site residues on the third, fourth, and fifth β-strands of the TIM-barrel catalytic domain that are differentially conserved within the substrate-diverse enzyme families. Using α-isopropylmalate synthase and citramalate synthase from Methanococcus jannaschii (MjIPMS and MjCMS), site-directed mutagenesis was used to test the role of each identified position in substrate selectivity. Kinetic data suggest that residues at the β3-5 and β4-7 positions play a significant role in the selection of α-ketoisovalerate over pyruvate in MjIPMS. However, complementary substitutions in MjCMS fail to alter substrate specificity, suggesting residues in these positions do not contribute to substrate selectivity in this enzyme. Analysis of the kinetic data with respect to a protein similarity network for the CC-like subgroup suggests that evolutionarily distinct forms of IPMS utilize residues at the β3-5 and β4-7 positions to affect substrate selectivity while the different versions of CMS use unique architectures. Importantly, mapping the identities of residues at the β3-5 and β4-7 positions onto the protein similarity network allows for rapid annotation of probable IPMS enzymes as well as several outlier sequences that may represent novel functions in the subgroup. PMID:26935545

  1. De novo cloning and annotation of genes associated with immunity, detoxification and energy metabolism from the fat body of the oriental fruit fly, Bactrocera dorsalis.

    PubMed

    Yang, Wen-Jia; Yuan, Guo-Rui; Cong, Lin; Xie, Yi-Fei; Wang, Jin-Jun

    2014-01-01

    The oriental fruit fly, Bactrocera dorsalis, is a destructive pest in tropical and subtropical areas. In this study, we performed transcriptome-wide analysis of the fat body of B. dorsalis and obtained more than 59 million sequencing reads, which were assembled into 27,787 unigenes with an average length of 591 bp. Among them, 17,442 (62.8%) unigenes matched known proteins in the NCBI database. The assembled sequences were further annotated with gene ontology, cluster of orthologous group terms, and Kyoto encyclopedia of genes and genomes. In depth analysis was performed to identify genes putatively involved in immunity, detoxification, and energy metabolism. Many new genes were identified including serpins, peptidoglycan recognition proteins and defensins, which were potentially linked to immune defense. Many detoxification genes were identified, including cytochrome P450s, glutathione S-transferases and ATP-binding cassette (ABC) transporters. Many new transcripts possibly involved in energy metabolism, including fatty acid desaturases, lipases, alpha amylases, and trehalose-6-phosphate synthases, were identified. Moreover, we randomly selected some genes to examine their expression patterns in different tissues by quantitative real-time PCR, which indicated that some genes exhibited fat body-specific expression in B. dorsalis. The identification of a numerous transcripts in the fat body of B. dorsalis laid the foundation for future studies on the functions of these genes. PMID:24710118

  2. De novo Cloning and Annotation of Genes Associated with Immunity, Detoxification and Energy Metabolism from the Fat Body of the Oriental Fruit Fly, Bactrocera dorsalis

    PubMed Central

    Yang, Wen-Jia; Yuan, Guo-Rui; Cong, Lin; Xie, Yi-Fei; Wang, Jin-Jun

    2014-01-01

    The oriental fruit fly, Bactrocera dorsalis, is a destructive pest in tropical and subtropical areas. In this study, we performed transcriptome-wide analysis of the fat body of B. dorsalis and obtained more than 59 million sequencing reads, which were assembled into 27,787 unigenes with an average length of 591 bp. Among them, 17,442 (62.8%) unigenes matched known proteins in the NCBI database. The assembled sequences were further annotated with gene ontology, cluster of orthologous group terms, and Kyoto encyclopedia of genes and genomes. In depth analysis was performed to identify genes putatively involved in immunity, detoxification, and energy metabolism. Many new genes were identified including serpins, peptidoglycan recognition proteins and defensins, which were potentially linked to immune defense. Many detoxification genes were identified, including cytochrome P450s, glutathione S-transferases and ATP-binding cassette (ABC) transporters. Many new transcripts possibly involved in energy metabolism, including fatty acid desaturases, lipases, alpha amylases, and trehalose-6-phosphate synthases, were identified. Moreover, we randomly selected some genes to examine their expression patterns in different tissues by quantitative real-time PCR, which indicated that some genes exhibited fat body-specific expression in B. dorsalis. The identification of a numerous transcripts in the fat body of B. dorsalis laid the foundation for future studies on the functions of these genes. PMID:24710118

  3. Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

    PubMed Central

    Ehrler, Frédéric; Geissbühler, Antoine; Jimeno, Antonio; Ruch, Patrick

    2005-01-01

    Background In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories. Methods Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Results Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. Conclusion From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists. PMID:15960836

  4. De Novo Assembly and Annotation of the Transcriptome of the Agricultural Weed Ipomoea purpurea Uncovers Gene Expression Changes Associated with Herbicide Resistance

    PubMed Central

    Leslie, Trent; Baucom, Regina S.

    2014-01-01

    Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance—one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate—EPSP synthase—was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

  5. Comprehensive investigation of parameter choice in viral integration site analysis and its effects on the gene annotations produced.

    PubMed

    Huston, Marshall W; Brugman, Martijn H; Horsman, Sebastiaan; Stubbs, Andrew; van der Spek, Peter; Wagemaker, Gerard

    2012-11-01

    Introducing therapeutic genes into hematopoietic stem cells using retroviral vector-mediated gene transfer is an effective treatment for monogenic diseases. The risks of therapeutic gene integration include aberrant expression of a neighboring gene, resulting in oncogenesis at low frequencies (10(-7)-10(-6)/transduced cell). Mechanisms governing insertional mutagenesis are the subject of intensive ongoing studies that produce large amounts of sequencing data representing genomic regions flanking viral integration sites (IS). Validating and analyzing these data require automated bioinformatics applications. The exact methods used vary between applications, based on the requirements and preferences of the designer. The parameters used to analyze sequence data are capable of shaping the resulting integration site annotations, but a comprehensive examination of these effects is lacking. Here we present a web-based tool for integration site analysis, called Methods for Analyzing ViRal Integration Collections (MAVRIC), and use its highly customizable interface to look at how IS annotations can vary based on the analysis parameters. We used the integration data of the previously published adenosine deaminase severe combined immunodeficiency (ADA-SCID) gene therapy trials for evaluation of MAVRIC. The output illustrates how MAVRIC allows for direct multiparameter comparison of integration patterns. Careful analysis of the SCID data and reanalyses using different parameters for trimming, alignment, and repeat masking revealed the degree of variation that can be expected to arise due to changes in these parameters. We observed mainly small differences in annotation, with the largest effects caused by masking repeat sequences and by changing the size of the window around the IS. PMID:22909036

  6. Cloning, Annotation and Developmental Expression of the Chicken Intestinal MUC2 Gene

    PubMed Central

    Jiang, Zhengyu; Applegate, Todd J.; Lossie, Amy C.

    2013-01-01

    Intestinal mucin 2 (MUC2) encodes a heavily glycosylated, gel-forming mucin, which creates an important protective mucosal layer along the gastrointestinal tract in humans and other species. This first line of defense guards against attacks from microorganisms and is integral to the innate immune system. As a first step towards characterizing the innate immune response of MUC2 in different species, we report the cloning of a full-length, 11,359 bp chicken MUC2 cDNA, and describe the genomic organization and functional annotation of this complex, 74.5 kb locus. MUC2 contains 64 exons and demonstrates distinct spatiotemporal expression profiles throughout development in the gastrointestinal tract; expression increases with gestational age and from anterior to posterior along the gut. The chicken protein has a similar domain organization as the human orthologue, with a signal peptide and several von Willebrand domains in the N-terminus and the characteristic cystine knot at the C-terminus. The PTS domain of the chicken MUC2 protein spans ?1600 amino acids and is interspersed with four CysD motifs. However, the PTS domain in the chicken diverges significantly from the human orthologue; although the chicken domain is shorter, the repetitive unit is 69 amino acids in length, which is three times longer than the human. The amino acid composition shows very little similarity to the human motif, which potentially contributes to differences in the innate immune response between species, as glycosylation across this rapidly evolving domain provides much of the musical barrier. Future studies of the function of MUC2 in the innate immune response system in chicken could provide an important model organism to increase our understanding of the biological significance of MUC2 in host defense and highlight the potential of the chicken for creating new immune-based therapies. PMID:23349743

  7. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes.

    PubMed

    Noguchi, Hideki; Taniguchi, Takeaki; Itoh, Takehiko

    2008-12-01

    Recent advances in DNA sequencers are accelerating genome sequencing, especially in microbes, and complete and draft genomes from various species have been sequenced in rapid succession. Here, we present a comprehensive gene prediction tool, the MetaGeneAnnotator (MGA), which precisely predicts all kinds of prokaryotic genes from a single or a set of anonymous genomic sequences having a variety of lengths. The MGA integrates statistical models of prophage genes, in addition to those of bacterial and archaeal genes, and also uses a self-training model from input sequences for predictions. As a result, the MGA sensitively detects not only typical genes but also atypical genes, such as horizontally transferred and prophage genes in a prokaryotic genome. In this paper, we also propose a novel approach for analyzing the ribosomal binding site (RBS), which enables us to detect species-specific patterns of the RBSs. The MGA has the ingenious RBS model based on this approach, and precisely predicts translation starts of genes. The MGA also succeeds in improving prediction accuracies for short sequences by using the adapted RBS models (96% sensitivity and 93% specificity for 700 bp fragments). These features of the MGA expedite wide ranges of microbial genome studies, such as genome annotations and metagenome analyses. PMID:18940874

  8. Comparison of lists of genes based on functional profiles

    PubMed Central

    2011-01-01

    Background How to compare studies on the basis of their biological significance is a problem of central importance in high-throughput genomics. Many methods for performing such comparisons are based on the information in databases of functional annotation, such as those that form the Gene Ontology (GO). Typically, they consist of analyzing gene annotation frequencies in some pre-specified GO classes, in a class-by-class way, followed by p-value adjustment for multiple testing. Enrichment analysis, where a list of genes is compared against a wider universe of genes, is the most common example. Results A new global testing procedure and a method incorporating it are presented. Instead of testing separately for each GO class, a single global test for all classes under consideration is performed. The test is based on the distance between the functional profiles, defined as the joint frequencies of annotation in a given set of GO classes. These classes may be chosen at one or more GO levels. The new global test is more powerful and accurate with respect to type I errors than the usual class-by-class approach. When applied to some real datasets, the results suggest that the method may also provide useful information that complements the tests performed using a class-by-class approach if gene counts are sparse in some classes. An R library, goProfiles, implements these methods and is available from Bioconductor, http://bioconductor.org/packages/release/bioc/html/goProfiles.html. Conclusions The method provides an inferential basis for deciding whether two lists are functionally different. For global comparisons it is preferable to the global chi-square test of homogeneity. Furthermore, it may provide additional information if used in conjunction with class-by-class methods. PMID:21999355

  9. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

    PubMed Central

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2016-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave the way in deciphering various secondary metabolite biosynthetic pathways in P. amarus, especially those conferring the medicinal attributes of this potent herb. PMID:26858723

  10. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform.

    PubMed

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2015-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) "unitranscript" sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave the way in deciphering various secondary metabolite biosynthetic pathways in P. amarus, especially those conferring the medicinal attributes of this potent herb. PMID:26858723

  11. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies

    PubMed Central

    Kichaev, Gleb; Pasaniuc, Bogdan

    2015-01-01

    Localization of causal variants underlying known risk loci is one of the main research challenges following genome-wide association studies. Risk loci are typically dissected through fine-mapping experiments in trans-ethnic cohorts for leveraging the variability in the local genetic structure across populations. More recent works have shown that genomic functional annotations (i.e., localization of tissue-specific regulatory marks) can be integrated for increasing fine-mapping performance within single-population studies. Here, we introduce methods that integrate the strength of association between genotype and phenotype, the variability in the genetic backgrounds across populations, and the genomic map of tissue-specific functional elements to increase trans-ethnic fine-mapping accuracy. Through extensive simulations and empirical data, we have demonstrated that our approach increases fine-mapping resolution over existing methods. We analyzed empirical data from a large-scale trans-ethnic rheumatoid arthritis (RA) study and showed that the functional genetic architecture of RA is consistent across European and Asian ancestries. In these data, we used our proposed methods to reduce the average size of the 90% credible set from 29 variants per locus for standard non-integrative approaches to 22 variants. PMID:26189819

  12. Heterologous expression of plasmodial proteins for structural studies and functional annotation

    PubMed Central

    Birkholtz, Lyn-Marie; Blatch, Gregory; Coetzer, Theresa L; Hoppe, Heinrich C; Human, Esmaré; Morris, Elizabeth J; Ngcete, Zoleka; Oldfield, Lyndon; Roth, Robyn; Shonhai, Addmore; Stephens, Linda; Louw, Abraham I

    2008-01-01

    Malaria remains the world's most devastating tropical infectious disease with as many as 40% of the world population living in risk areas. The widespread resistance of Plasmodium parasites to the cost-effective chloroquine and antifolates has forced the introduction of more costly drug combinations, such as Coartem®. In the absence of a vaccine in the foreseeable future, one strategy to address the growing malaria problem is to identify and characterize new and durable antimalarial drug targets, the majority of which are parasite proteins. Biochemical and structure-activity analysis of these proteins is ultimately essential in the characterization of such targets but requires large amounts of functional protein. Even though heterologous protein production has now become a relatively routine endeavour for most proteins of diverse origins, the functional expression of soluble plasmodial proteins is highly problematic and slows the progress of antimalarial drug target discovery. Here the status quo of heterologous production of plasmodial proteins is presented, constraints are highlighted and alternative strategies and hosts for functional expression and annotation of plasmodial proteins are reviewed. PMID:18828893

  13. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies.

    PubMed

    Kichaev, Gleb; Pasaniuc, Bogdan

    2015-08-01

    Localization of causal variants underlying known risk loci is one of the main research challenges following genome-wide association studies. Risk loci are typically dissected through fine-mapping experiments in trans-ethnic cohorts for leveraging the variability in the local genetic structure across populations. More recent works have shown that genomic functional annotations (i.e., localization of tissue-specific regulatory marks) can be integrated for increasing fine-mapping performance within single-population studies. Here, we introduce methods that integrate the strength of association between genotype and phenotype, the variability in the genetic backgrounds across populations, and the genomic map of tissue-specific functional elements to increase trans-ethnic fine-mapping accuracy. Through extensive simulations and empirical data, we have demonstrated that our approach increases fine-mapping resolution over existing methods. We analyzed empirical data from a large-scale trans-ethnic rheumatoid arthritis (RA) study and showed that the functional genetic architecture of RA is consistent across European and Asian ancestries. In these data, we used our proposed methods to reduce the average size of the 90% credible set from 29 variants per locus for standard non-integrative approaches to 22 variants. PMID:26189819

  14. Homology modeling, comparative genomics and functional annotation of Mycoplasma genitalium hypothetical protein MG_237.

    PubMed

    Butt, Azeem Mehmood; Batool, Maria; Tong, Yigang

    2011-01-01

    Mycoplasma genitalium is a human pathogen associated with several sexually transmitted diseases. The complete genome of M. genitalium G37 has been sequenced and provides an opportunity to understand the pathogenesis and identification of therapeutic targets. However, complete understanding of bacterial function requires proper annotation of its proteins. The genome of M. genitalium consists of 475 proteins. Among these, 94 are without any known function and are described as 'hypothetical proteins'. We selected MG_237 for sequence and structural analysis using a bioinformatics approach. Primary and secondary structure analysis suggested that MG_237 is a hydrophilic protein containing a significant proportion of alpha helices, and subcellular localization predictions suggested it is a cytoplasmic protein. Homology modeling was used to define the three-dimensional (3D) structure of MG-237. A search for templates revealed that MG_237 shares 63% homology to a hypothetical protein of Mycoplasma pneumoniae, indicating this protein is evolutionary conserved. The refined 3D model was generated using (PS)(2)-v2 sever that incorporates MODELLER. Several quality assessment and validation parameters were computed and indicated that the homology model is reliable. Furthermore, comparative genomics analysis suggested MG_237 as non-homologous protein and involved in four different metabolic pathways. Experimental validation will provide more insight into the actual function of this protein in microbial pathways. PMID:22355225

  15. Homology modeling and assigned functional annotation of an uncharacterized antitoxin protein from Streptomyces xinghaiensis

    PubMed Central

    Oany, Arafat Rahman; Ahmed, Md Shahabuddin; Jahan, Nasreen; Latif, Md Abdul; Mahmud, Shahin; Hossain, Md. Ahmed; Akter, Fatema; Rakib, Hasibul Haque; Islam, Md. Shariful

    2015-01-01

    Streptomyces xinghaiensis is a Gram-positive, aerobic and non-motile bacterium. The bacterial genome is known. Therefore, it is of interest to study the uncharacterized proteins in the genome. An uncharacterized protein (gi|518540893|86 residues) in the genome was selected for a comprehensive computational sequence-structure-function analysis using available data and tools. Subcellular localization of the targeted protein with conserved residues and assigned secondary structures is documented. Sequence homology search against the protein data bank (PDB) and non-redundant GenBank proteins using BLASTp showed different homologous proteins with known antitoxin function. A homology model of the target protein was developed using a known template (PDB ID: 3CTO:A) with 62% sequence similarity in HHpred after assessment using programs PROCHECK and QMEAN6. The predicted active site using CASTp is analyzed for assigned anti-toxin function. This information finds specific utility in annotating the said uncharacterized protein in the bacterial genome. PMID:26912949

  16. PHYLOGENOMICS - GUIDED VALIDATION OF FUNCTION FOR CONSERVED UNKNOWN GENES

    SciTech Connect

    V, DE CRECY-LAGARD; D, HANSON A

    2012-01-03

    Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown function, or wrongly or vaguely annotated. Many of these 'unknown' proteins are common to prokaryotes and plants. We accordingly set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction is integrative, coupling the extensive post-genomic resources available for plants with comparative genomics based on hundreds of microbial genomes, and functional genomic datasets from model microorganisms. The early phase is computer-assisted; later phases incorporate intellectual input from expert plant and microbial biochemists. The approach thus bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is much more powerful than purely computational approaches to identifying gene-function associations. Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) are conserved between plants and prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology .. independent characteristics associated in the SEED database with the prokaryotic members of each family, specifically gene clustering and phyletic spread, as well as availability of functional genomics data, and publications that could link candidate families to general metabolic areas, or to specific functions. In-depth comparative genomic analysis was then performed for about 500 top candidate families, which connected ~55 of them to general areas of metabolism and led to specific functional predictions for a subset of ~25 more. Twenty predicted functions were experimentally tested in at least one prokaryotic organism via reverse genetics, metabolic profiling, functional complementation, and recombinant protein biochemistry. Our approach predicted and validated functions for 10 formerly uncharacterized protein families common to plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The functions of five more are currently being validated. Experimental testing of diverse representatives of these families combined with in silica analysis allowed accurate projection of the annotations to hundreds more sequenced genomes.

  17. The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes

    PubMed Central

    Overbeek, Ross; Begley, Tadhg; Butler, Ralph M.; Choudhuri, Jomuna V.; Chuang, Han-Yu; Cohoon, Matthew; de Crécy-Lagard, Valérie; Diaz, Naryttza; Disz, Terry; Edwards, Robert; Fonstein, Michael; Frank, Ed D.; Gerdes, Svetlana; Glass, Elizabeth M.; Goesmann, Alexander; Hanson, Andrew; Iwata-Reuyl, Dirk; Jensen, Roy; Jamshidi, Neema; Krause, Lutz; Kubal, Michael; Larsen, Niels; Linke, Burkhard; McHardy, Alice C.; Meyer, Folker; Neuweger, Heiko; Olsen, Gary; Olson, Robert; Osterman, Andrei; Portnoy, Vasiliy; Pusch, Gordon D.; Rodionov, Dmitry A.; Rückert, Christian; Steiner, Jason; Stevens, Rick; Thiele, Ines; Vassieva, Olga; Ye, Yuzhen; Zagnitko, Olga; Vonstein, Veronika

    2005-01-01

    The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180?177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms. PMID:16214803

  18. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGESBeta

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  19. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments

    PubMed Central

    Al-Shahrour, Fátima; Minguez, Pablo; Tárraga, Joaquín; Montaner, David; Alloza, Eva; Vaquerizas, Juan M.; Conde, Lucía; Blaschke, Christian; Vera, Javier; Dopazo, Joaquín

    2006-01-01

    We present a new version of Babelomics, a complete suite of web tools for functional analysis of genome-scale experiments, with new and improved tools. New functionally relevant terms have been included such as CisRed motifs or bioentities obtained by text-mining procedures. An improved indexing has considerably speeded up several of the modules. An improved version of the FatiScan method for studying the coordinate behaviour of groups of functionally related genes is presented, along with a similar tool, the Gene Set Enrichment Analysis. Babelomics is now more oriented to test systems biology inspired hypotheses. Babelomics can be found at . PMID:16845052

  20. BASys: a web server for automated bacterial genome annotation.

    PubMed

    Van Domselaar, Gary H; Stothard, Paul; Shrivastava, Savita; Cruz, Joseph A; Guo, AnChi; Dong, Xiaoli; Lu, Paul; Szafron, Duane; Greiner, Russ; Wishart, David S

    2005-07-01

    BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine approximately 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colorful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 h for an average bacterial chromosome (5 Mb). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at http://wishart.biology.ualberta.ca/basys. PMID:15980511

  1. BASys: a web server for automated bacterial genome annotation

    PubMed Central

    Van Domselaar, Gary H.; Stothard, Paul; Shrivastava, Savita; Cruz, Joseph A.; Guo, AnChi; Dong, Xiaoli; Lu, Paul; Szafron, Duane; Greiner, Russ; Wishart, David S.

    2005-01-01

    BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine ?60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colorful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in ?24 h for an average bacterial chromosome (5 Mb). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at . PMID:15980511

  2. Reconstruction of signaling network from protein interactions based on function annotations.

    PubMed

    Liu, Wei; Li, Dong; Zhu, Yunping; Xie, Hongwei; He, Fuchu

    2013-01-01

    The directionality of protein interactions is the prerequisite of forming various signaling networks, and the construction of signaling networks is a critical issue in the discovering the mechanism of the life process. In this paper, we proposed a novel method to infer the directionality in protein-protein interaction networks and furthermore construct signaling networks. Based on the functional annotations of proteins, we proposed a novel parameter GODS and established the prediction model. This method shows high sensitivity and specificity to predict the directionality of protein interactions, evaluated by fivefold cross validation. By taking the threshold value of GODS as 2, we achieved accuracy 95.56 percent and coverage 74.69 percent in the human test set. Also, this method was successfully applied to reconstruct the classical signaling pathways in human. This study not only provided an effective method to unravel the unknown signaling pathways, but also the deeper understanding for the signaling networks, from the aspect of protein function. PMID:23929874

  3. In silico identification and functional annotation of yeast E3 ubiquitin ligase Rsp5 substrates.

    PubMed

    Song, Xiaofeng; Hu, Lizhen; Han, Ping; Guo, Xuejiang; Sha, Jiahao

    2015-01-01

    Rsp5, E3 ligases conserved from yeast to mammals, plays a key role in diverse processes in yeast. However, many of Rsp5 substrates are still unclear. Therefore we proposed an in silico method to recognise new substrates of Rsp5. To investigate the molecular determinants that affect the interaction between Rsp5 and its substrate, we have systematically analysed many features that perhaps correlated with the Rsp5 substrate recognition. It is found that PPxY motif, transmembrane region, disorder region and N-linked glycosylation modification are the most important features for substrate recognition. We have constructed an SVM-based classifier to recognise Rsp5 substrates, obtaining 81.5% sensitivity and 74.1% specificity averagely on ten independent testing dataset. We also applied the model on the whole yeast proteome, and identified -66 new Rsp5 substrates. Functional annotation reveals that half of these novel substrates function in the Rsp5 involved cell processes as Rsp5-interacting proteins. PMID:26547982

  4. Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms

    PubMed Central

    Nasir, Arshan; Naeem, Aisha; Khan, Muhammad Jawad; Lopez-Nicora, Horacio D.; Caetano-Anollés, Gustavo

    2011-01-01

    The functional repertoire of a cell is largely embodied in its proteome, the collection of proteins encoded in the genome of an organism. The molecular functions of proteins are the direct consequence of their structure and structure can be inferred from sequence using hidden Markov models of structural recognition. Here we analyze the functional annotation of protein domain structures in almost a thousand sequenced genomes, exploring the functional and structural diversity of proteomes. We find there is a remarkable conservation in the distribution of domains with respect to the molecular functions they perform in the three superkingdoms of life. In general, most of the protein repertoire is spent in functions related to metabolic processes but there are significant differences in the usage of domains for regulatory and extra-cellular processes both within and between superkingdoms. Our results support the hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion mechanisms that were directed towards innovating new domain architectures for regulatory and extra/intracellular process functions needed for example to maintain the integrity of multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell signaling and adhesion, immune responses, and toxin production). Proteomes of microbial superkingdoms Archaea and Bacteria retained fewer numbers of domains and maintained simple and smaller protein repertoires. Viruses appear to play an important role in the evolution of superkingdoms. We finally identify few genomic outliers that deviate significantly from the conserved functional design. These include Nanoarchaeum equitans, proteobacterial symbionts of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms spend most of their domains on information functions, including translation and transcription, rather than on metabolism and harbor a domain repertoire characteristic of parasitic organisms. In contrast, the functional repertoire of the proteomes of the Planctomycetes-Verrucomicrobia-Chlamydiae superphylum was no different than the rest of bacteria, failing to support claims of them representing a separate superkingdom. In turn, Protista and Bacteria shared similar functional distribution patterns suggesting an ancestral evolutionary link between these groups. PMID:24710297

  5. Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms.

    PubMed

    Nasir, Arshan; Naeem, Aisha; Khan, Muhammad Jawad; Nicora, Horacio D Lopez; Caetano-Anollés, Gustavo

    2011-01-01

    The functional repertoire of a cell is largely embodied in its proteome, the collection of proteins encoded in the genome of an organism. The molecular functions of proteins are the direct consequence of their structure and structure can be inferred from sequence using hidden Markov models of structural recognition. Here we analyze the functional annotation of protein domain structures in almost a thousand sequenced genomes, exploring the functional and structural diversity of proteomes. We find there is a remarkable conservation in the distribution of domains with respect to the molecular functions they perform in the three superkingdoms of life. In general, most of the protein repertoire is spent in functions related to metabolic processes but there are significant differences in the usage of domains for regulatory and extra-cellular processes both within and between superkingdoms. Our results support the hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion mechanisms that were directed towards innovating new domain architectures for regulatory and extra/intracellular process functions needed for example to maintain the integrity of multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell signaling and adhesion, immune responses, and toxin production). Proteomes of microbial superkingdoms Archaea and Bacteria retained fewer numbers of domains and maintained simple and smaller protein repertoires. Viruses appear to play an important role in the evolution of superkingdoms. We finally identify few genomic outliers that deviate significantly from the conserved functional design. These include Nanoarchaeum equitans, proteobacterial symbionts of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms spend most of their domains on information functions, including translation and transcription, rather than on metabolism and harbor a domain repertoire characteristic of parasitic organisms. In contrast, the functional repertoire of the proteomes of the Planctomycetes-Verrucomicrobia-Chlamydiae superphylum was no different than the rest of bacteria, failing to support claims of them representing a separate superkingdom. In turn, Protista and Bacteria shared similar functional distribution patterns suggesting an ancestral evolutionary link between these groups. PMID:24710297

  6. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/. PMID:25703262

  7. GeneViTo: Visualizing gene-product functional and structural features in genomic datasets

    PubMed Central

    Vernikos, Georgios S; Gkogkas, Christos G; Promponas, Vasilis J; Hamodrakas, Stavros J

    2003-01-01

    Background The availability of increasing amounts of sequence data from completely sequenced genomes boosts the development of new computational methods for automated genome annotation and comparative genomics. Therefore, there is a need for tools that facilitate the visualization of raw data and results produced by bioinformatics analysis, providing new means for interactive genome exploration. Visual inspection can be used as a basis to assess the quality of various analysis algorithms and to aid in-depth genomic studies. Results GeneViTo is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources) and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI) allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features. Emphasis is laid on the potential comparison between annotation and prediction data in order to offer a supplement to the provided information, especially in cases of "poor" annotation, or an evaluation of available predictions. Moreover, desired information can be output in high quality JPEG image files for further elaboration and scientific use. A compilation of properly formatted GeneViTo input data for demonstration is available to interested readers for two completely sequenced prokaryotes, Chlamydia trachomatis and Methanococcus jannaschii. Conclusions GeneViTo offers an inspectional view of genomic functional elements, concerning data stemming both from database annotation and analysis tools for an overall analysis of existing genomes. The application is compatible with Linux or Windows ME-2000-XP operating systems, provided that the appropriate Java Runtime Environment is already installed in the system. PMID:14594459

  8. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  9. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs.

    PubMed

    Liu, Xiaoming; Wu, Chunlei; Li, Chang; Boerwinkle, Eric

    2016-03-01

    The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study. A list of all potential nsSNVs and ssSNVs based on the human reference sequence were created and functional predictions and annotations were curated and compiled for each SNV. Here, we report a recent major update of the database to version 3.0. The SNV list has been rebuilt based on GENCODE 22 and currently the database includes 82,832,027 nsSNVs and ssSNVs. An attached database dbscSNV, which compiled all potential human SNVs within splicing consensus regions and their deleteriousness predictions, add another 15,030,459 potentially functional SNVs. Eleven prediction scores (MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4× fitCons, fathmm-MKL, and DANN) and allele frequencies from the UK10K cohorts and the Exome Aggregation Consortium (ExAC), among others, have been added. The original seven prediction scores in v2.0 (SIFT, 2× Polyphen2, LRT, MutationTaster, MutationAssessor, and FATHMM) as well as many SNV and gene functional annotations have been updated. dbNSFP v3.0 is freely available at http://sites.google.com/site/jpopgen/dbNSFP. PMID:26555599

  10. De Novo Assembly and Functional Annotation of the Olive (Olea europaea) Transcriptome

    PubMed Central

    Muñoz-Mérida, Antonio; González-Plaza, Juan José; Cañada, Andrés; Blanco, Ana María; García-López, Maria del Carmen; Rodríguez, José Manuel; Pedrola, Laia; Sicardo, M. Dolores; Hernández, M. Luisa; De la Rosa, Raúl; Belaj, Angjelina; Gil-Borja, Mayte; Luque, Francisco; Martínez-Rivas, José Manuel; Pisano, David G.; Trelles, Oswaldo; Valpuesta, Victoriano; Beuzón, Carmen R.

    2013-01-01

    Olive breeding programmes are focused on selecting for traits as short juvenile period, plant architecture suited for mechanical harvest, or oil characteristics, including fatty acid composition, phenolic, and volatile compounds to suit new markets. Understanding the molecular basis of these characteristics and improving the efficiency of such breeding programmes require the development of genomic information and tools. However, despite its economic relevance, genomic information on olive or closely related species is still scarce. We have applied Sanger and 454 pyrosequencing technologies to generate close to 2 million reads from 12 cDNA libraries obtained from the Picual, Arbequina, and Lechin de Sevilla cultivars and seedlings from a segregating progeny of a Picual × Arbequina cross. The libraries include fruit mesocarp and seeds at three relevant developmental stages, young stems and leaves, active juvenile and adult buds as well as dormant buds, and juvenile and adult roots. The reads were assembled by library or tissue and then assembled together into 81 020 unigenes with an average size of 496 bases. Here, we report their assembly and their functional annotation. PMID:23297299

  11. GenoQuery: a new querying module for functional annotation in a genomic warehouse

    PubMed Central

    Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine

    2008-01-01

    Motivation: We have to cope with both a deluge of new genome sequences and a huge amount of data produced by high-throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architecture made of a databases layer and an entities layer. We describe a new querying module, GenoQuery, which is based on this architecture. We use the entities layer to define mixed queries. These mixed queries allow searching for instances of biological entities and their properties in the different databases, without specifying in which database they should be found. Accordingly, we further introduce the central notion of alternative queries. Such queries have the same meaning as the original mixed queries, while exploiting complementarities yielded by the various integrated databases of the warehouse. We explain how GenoQuery computes all the alternative queries of a given mixed query. We illustrate how useful this querying module is by means of a thorough example. Availability: http://www.lri.fr/~lemoine/GenoQuery/ Contact: chris@lri.fr, lemoine@lri.fr PMID:18586731

  12. Automated annotation of functional imaging experiments via multi-label classification

    PubMed Central

    Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112

  13. CSCdb: a cancer stem cells portal for markers, related genes and functional information.

    PubMed

    Shen, Yi; Yao, Heming; Li, Ao; Wang, Minghui

    2016-01-01

    Cancer stem cells (CSCs), which have the ability to self-renew and differentiate into various tumor cell types, are a special class of tumor cells. Characterizing the genes involved in CSCs regulation is fundamental to understand the mechanisms underlying the biological process and develop treatment methods for tumor therapy. Recently, much effort has been expended in the study of CSCs and a large amount of data has been generated. However, to the best of our knowledge, database dedicated to CSCs is not available until now. We have thus developed a CSCs database (CSCdb), which includes marker genes, CSCs-related genes/microRNAs and functional annotations. The information in the CSCdb was manual collected from about 13 000 articles. The CSCdb provides detailed information of 1769 genes that have been reported to participate in the functional regulation of CSCs and 74 marker genes that can be used for identification or isolation of CSCs. The CSCdb also provides 9475 annotations about 13 CSCs-related functions, such as oncogenesis, radio resistance, tumorigenesis, differentiation, etc. Annotations of the identified genes, which include protein function description, post-transcription modification information, related literature, Gene Ontology (GO), protein-protein interaction (PPI) information and regulatory relationships, are integrated into the CSCdb to help users get information more easily. CSCdb provides a comprehensive resource for CSCs research work, which would assist in finding new CSCs-related genes and would be a useful tool for biologists.Database URL: http://bioinformatics.ustc.edu.cn/cscdb. PMID:26989154

  14. CSCdb: a cancer stem cells portal for markers, related genes and functional information

    PubMed Central

    Shen, Yi; Yao, Heming; Wang, Minghui

    2016-01-01

    Cancer stem cells (CSCs), which have the ability to self-renew and differentiate into various tumor cell types, are a special class of tumor cells. Characterizing the genes involved in CSCs regulation is fundamental to understand the mechanisms underlying the biological process and develop treatment methods for tumor therapy. Recently, much effort has been expended in the study of CSCs and a large amount of data has been generated. However, to the best of our knowledge, database dedicated to CSCs is not available until now. We have thus developed a CSCs database (CSCdb), which includes marker genes, CSCs-related genes/microRNAs and functional annotations. The information in the CSCdb was manual collected from about 13 000 articles. The CSCdb provides detailed information of 1769 genes that have been reported to participate in the functional regulation of CSCs and 74 marker genes that can be used for identification or isolation of CSCs. The CSCdb also provides 9475 annotations about 13 CSCs-related functions, such as oncogenesis, radio resistance, tumorigenesis, differentiation, etc. Annotations of the identified genes, which include protein function description, post-transcription modification information, related literature, Gene Ontology (GO), protein-protein interaction (PPI) information and regulatory relationships, are integrated into the CSCdb to help users get information more easily. CSCdb provides a comprehensive resource for CSCs research work, which would assist in finding new CSCs-related genes and would be a useful tool for biologists. Database URL: http://bioinformatics.ustc.edu.cn/cscdb PMID:26989154

  15. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-01-01

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation. © 2015 by John Wiley & Sons, Inc. PMID:26678385

  16. Characterization of transcriptome dynamics during watermelon fruit development: sequencing, assembly, annotation and gene expression profiles

    PubMed Central

    2011-01-01

    Background Cultivated watermelon [Citrullus lanatus (Thunb.) Matsum. & Nakai var. lanatus] is an important agriculture crop world-wide. The fruit of watermelon undergoes distinct stages of development with dramatic changes in its size, color, sweetness, texture and aroma. In order to better understand the genetic and molecular basis of these changes and significantly expand the watermelon transcript catalog, we have selected four critical stages of watermelon fruit development and used Roche/454 next-generation sequencing technology to generate a large expressed sequence tag (EST) dataset and a comprehensive transcriptome profile for watermelon fruit flesh tissues. Results We performed half Roche/454 GS-FLX run for each of the four watermelon fruit developmental stages (immature white, white-pink flesh, red flesh and over-ripe) and obtained 577,023 high quality ESTs with an average length of 302.8 bp. De novo assembly of these ESTs together with 11,786 watermelon ESTs collected from GenBank produced 75,068 unigenes with a total length of approximately 31.8 Mb. Overall 54.9% of the unigenes showed significant similarities to known sequences in GenBank non-redundant (nr) protein database and around two-thirds of them matched proteins of cucumber, the most closely-related species with a sequenced genome. The unigenes were further assigned with gene ontology (GO) terms and mapped to biochemical pathways. More than 5,000 SSRs were identified from the EST collection. Furthermore we carried out digital gene expression analysis of these ESTs and identified 3,023 genes that were differentially expressed during watermelon fruit development and ripening, which provided novel insights into watermelon fruit biology and a comprehensive resource of candidate genes for future functional analysis. We then generated profiles of several interesting metabolites that are important to fruit quality including pigmentation and sweetness. Integrative analysis of metabolite and digital gene expression profiles helped elucidating molecular mechanisms governing these important quality-related traits during watermelon fruit development. Conclusion We have generated a large collection of watermelon ESTs, which represents a significant expansion of the current transcript catalog of watermelon and a valuable resource for future studies on the genomics of watermelon and other closely-related species. Digital expression analysis of this EST collection allowed us to identify a large set of genes that were differentially expressed during watermelon fruit development and ripening, which provide a rich source of candidates for future functional analysis and represent a valuable increase in our knowledge base of watermelon fruit biology. PMID:21936920

  17. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    SciTech Connect

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  18. A transcriptomic analysis of striped catfish (Pangasianodon hypophthalmus) in response to salinity adaptation: De novo assembly, gene annotation and marker discovery.

    PubMed

    Thanh, Nguyen Minh; Jung, Hyungtaek; Lyons, Russell E; Chand, Vincent; Tuan, Nguyen Viet; Thu, Vo Thi Minh; Mather, Peter

    2014-06-01

    The striped catfish (Pangasianodon hypophthalmus) culture industry in the Mekong Delta in Vietnam has developed rapidly over the past decade. The culture industry now however, faces some significant challenges, especially related to climate change impacts notably from predicted extensive saltwater intrusion into many low topographical coastal provinces across the Mekong Delta. This problem highlights a need for development of culture stocks that can tolerate more saline culture environments as a response to expansion of saline water-intruded land. While a traditional artificial selection program can potentially address this need, understanding the genomic basis of salinity tolerance can assist development of more productive culture lines. The current study applied a transcriptomic approach using Ion PGM technology to generate expressed sequence tag (EST) resources from the intestine and swim bladder from striped catfish reared at a salinity level of 9ppt which showed best growth performance. Total sequence data generated was 467.8Mbp, consisting of 4,116,424 reads with an average length of 112bp. De novo assembly was employed that generated 51,188 contigs, and allowed identification of 16,116 putative genes based on the GenBank non-redundant database. GO annotation, KEGG pathway mapping, and functional annotation of the EST sequences recovered with a wide diversity of biological functions and processes. In addition, more than 11,600 simple sequence repeats were also detected. This is the first comprehensive analysis of a striped catfish transcriptome, and provides a valuable genomic resource for future selective breeding programs and functional or evolutionary studies of genes that influence salinity tolerance in this important culture species. PMID:24841517

  19. A novel method to quantify gene set functional association based on gene ontology

    PubMed Central

    Lv, Sali; Li, Yan; Wang, Qianghu; Ning, Shangwei; Huang, Teng; Wang, Peng; Sun, Jie; Zheng, Yan; Liu, Weisha; Ai, Jing; Li, Xia

    2012-01-01

    Numerous gene sets have been used as molecular signatures for exploring the genetic basis of complex disorders. These gene sets are distinct but related to each other in many cases; therefore, efforts have been made to compare gene sets for studies such as those evaluating the reproducibility of different experiments. Comparison in terms of biological function has been demonstrated to be helpful to biologists. We improved the measurement of semantic similarity to quantify the functional association between gene sets in the context of gene ontology and developed a web toolkit named Gene Set Functional Similarity (GSFS; http://bioinfo.hrbmu.edu.cn/GSFS). Validation based on protein complexes for which the functional associations are known demonstrated that the GSFS scores tend to be correlated with sequence similarity scores and that complexes with high GSFS scores tend to be involved in the same functional catalogue. Compared with the pairwise method and the annotation method, the GSFS shows better discrimination and more accurately reflects the known functional catalogues shared between complexes. Case studies comparing differentially expressed genes of prostate tumour samples from different microarray platforms and identifying coronary heart disease susceptibility pathways revealed that the method could contribute to future studies exploring the molecular basis of complex disorders. PMID:21998111

  20. Proteomics for Validation of Automated Gene Model Predictions

    SciTech Connect

    Zhou, Kemin; Panisko, Ellen A.; Magnuson, Jon K.; Baker, Scott E.; Grigoriev, Igor V.

    2008-02-14

    High-throughput liquid chromatography mass spectrometry (LC-MS)-based proteomic analysis has emerged as a powerful tool for functional annotation of genome sequences. These analyses complement the bioinformatic and experimental tools used for deriving, verifying, and functionally annotating models of genes and their transcripts. Furthermore, proteomics extends verification and functional annotation to the level of the translation product of the gene model.

  1. The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database.

    PubMed

    Hayman, G Thomas; Laulederkind, Stanley J F; Smith, Jennifer R; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD;http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene-disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway.Database URL:http://rgd.mcw.edu. PMID:27009807

  2. Improving the Annotation of Arabidopsis lyrata Using RNA-Seq Data

    PubMed Central

    Rawat, Vimal; Abdelsamad, Ahmed; Pietzenuk, Björn; Seymour, Danelle K.; Koenig, Daniel; Weigel, Detlef; Pecinka, Ales; Schneeberger, Korbinian

    2015-01-01

    Gene model annotations are important community resources that ensure comparability and reproducibility of analyses and are typically the first step for functional annotation of genomic regions. Without up-to-date genome annotations, genome sequences cannot be used to maximum advantage. It is therefore essential to regularly update gene annotations by integrating the latest information to guarantee that reference annotations can remain a common basis for various types of analyses. Here, we report an improvement of the Arabidopsis lyrata gene annotation using extensive RNA-seq data. This new annotation consists of 31,132 protein coding gene models in addition to 2,089 genes with high similarity to transposable elements. Overall, ~87% of the gene models are corroborated by evidence of expression and 2,235 of these models feature multiple transcripts. Our updated gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved. PMID:26382944

  3. Analysis of mammalian gene function through broad based phenotypic screens across a consortium of mouse clinics

    PubMed Central

    Adams, David J; Adams, Niels C; Adler, Thure; Aguilar-Pimentel, Antonio; Ali-Hadji, Dalila; Amann, Gregory; André, Philippe; Atkins, Sarah; Auburtin, Aurelie; Ayadi, Abdel; Becker, Julien; Becker, Lore; Bedu, Elodie; Bekeredjian, Raffi; Birling, Marie-Christine; Blake, Andrew; Bottomley, Joanna; Bowl, Mike; Brault, Véronique; Busch, Dirk H; Bussell, James N; Calzada-Wack, Julia; Cater, Heather; Champy, Marie-France; Charles, Philippe; Chevalier, Claire; Chiani, Francesco; Codner, Gemma F; Combe, Roy; Cox, Roger; Dalloneau, Emilie; Dierich, André; Di Fenza, Armida; Doe, Brendan; Duchon, Arnaud; Eickelberg, Oliver; Esapa, Chris T; El Fertak, Lahcen; Feigel, Tanja; Emelyanova, Irina; Estabel, Jeanne; Favor, Jack; Flenniken, Ann; Gambadoro, Alessia; Garrett, Lilian; Gates, Hilary; Gerdin, Anna-Karin; Gkoutos, George; Greenaway, Simon; Glasl, Lisa; Goetz, Patrice; Da Cruz, Isabelle Goncalves; Götz, Alexander; Graw, Jochen; Guimond, Alain; Hans, Wolfgang; Hicks, Geoff; Hölter, Sabine M; Höfler, Heinz; Hancock, John M; Hoehndorf, Robert; Hough, Tertius; Houghton, Richard; Hurt, Anja; Ivandic, Boris; Jacobs, Hughes; Jacquot, Sylvie; Jones, Nora; Karp, Natasha A; Katus, Hugo A; Kitchen, Sharon; Klein-Rodewald, Tanja; Klingenspor, Martin; Klopstock, Thomas; Lalanne, Valerie; Leblanc, Sophie; Lengger, Christoph; le Marchand, Elise; Ludwig, Tonia; Lux, Aline; McKerlie, Colin; Maier, Holger; Mandel, Jean-Louis; Marschall, Susan; Mark, Manuel; Melvin, David G; Meziane, Hamid; Micklich, Kateryna; Mittelhauser, Christophe; Monassier, Laurent; Moulaert, David; Muller, Stéphanie; Naton, Beatrix; Neff, Frauke; Nolan, Patrick M; Nutter, Lauryl MJ; Ollert, Markus; Pavlovic, Guillaume; Pellegata, Natalia S; Peter, Emilie; Petit-Demoulière, Benoit; Pickard, Amanda; Podrini, Christine; Potter, Paul; Pouilly, Laurent; Puk, Oliver; Richardson, David; Rousseau, Stephane; Quintanilla-Fend, Leticia; Quwailid, Mohamed M; Racz, Ildiko; Rathkolb, Birgit; Riet, Fabrice; Rossant, Janet; Roux, Michel; Rozman, Jan; Ryder, Ed; Salisbury, Jennifer; Santos, Luis; Schäble, Karl-Heinz; Schiller, Evelyn; Schrewe, Anja; Schulz, Holger; Steinkamp, Ralf; Simon, Michelle; Stewart, Michelle; Stöger, Claudia; Stöger, Tobias; Sun, Minxuan; Sunter, David; Teboul, Lydia; Tilly, Isabelle; Tocchini-Valentini, Glauco P; Tost, Monica; Treise, Irina; Vasseur, Laurent; Velot, Emilie; Vogt-Weisenhorn, Daniela; Wagner, Christelle; Walling, Alison; Weber, Bruno; Wendling, Olivia; Westerberg, Henrik; Willershäuser, Monja; Wolf, Eckhard; Wolter, Anne; Wood, Joe; Wurst, Wolfgang; Yildirim, Ali Önder; Zeh, Ramona; Zimmer, Andreas; Zimprich, Annemarie

    2015-01-01

    The function of the majority of genes in the mouse and human genomes remains unknown. The mouse ES cell knockout resource provides a basis for characterisation of relationships between gene and phenotype. The EUMODIC consortium developed and validated robust methodologies for broad-based phenotyping of knockouts through a pipeline comprising 20 disease-orientated platforms. We developed novel statistical methods for pipeline design and data analysis aimed at detecting reproducible phenotypes with high power. We acquired phenotype data from 449 mutant alleles, representing 320 unique genes, of which half had no prior functional annotation. We captured data from over 27,000 mice finding that 83% of the mutant lines are phenodeviant, with 65% demonstrating pleiotropy. Surprisingly, we found significant differences in phenotype annotation according to zygosity. Novel phenotypes were uncovered for many genes with unknown function providing a powerful basis for hypothesis generation and further investigation in diverse systems. PMID:26214591

  4. In-depth transcriptome analysis of Coilia ectenes, an important fish resource in the Yangtze River: de novo assembly, gene annotation.

    PubMed

    Shen, Huaishun; Gu, Ruobo; Xu, Gangchun; Xu, Pao; Nie, Zijuan; Hu, Yacheng

    2015-10-01

    Coilia ectenes is an important teleost species in the Yangtze River and a model organism that can be used to study the protection of fish resources. In this report, we performed de novo transcriptome sequencing of ten cDNA libraries from the brain, gill, heart, intestine, kidney, liver, muscle, stomach, ovary, and testis tissues. A total of 352 million raw reads of 100 base pairs were generated, and 130,113 transcripts, corresponding to 65,350 non-redundant transcripts, with a mean length of 1520 bp, were assembled. BLASTx-based gene annotation (E-value<1 × 10(-5)) allowed the identification of 73,900 transcripts against at least one of four databases, including the NCBI non-redundant database, the GO database, the COG database, and the KEGG database. Our study provides a valuable resource for C. ectenes genomic and transcriptomic data that will facilitate future functional studies of C. ectenes. PMID:25795024

  5. A Factor Graph Approach to Automated GO Annotation

    PubMed Central

    Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  6. A Factor Graph Approach to Automated GO Annotation.

    PubMed

    Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar

    2016-01-01

    As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463

  7. Functionally Enigmatic Genes: A Case Study of the Brain Ignorome

    PubMed Central

    Pandey, Ashutosh K.; Lu, Lu; Wang, Xusheng; Homayouni, Ramin; Williams, Robert W.

    2014-01-01

    What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed—the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases—ELMOD1, TMEM88B, and DZANK1—we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes. PMID:24523945

  8. Annotation of Ehux ESTs

    SciTech Connect

    Kuo, Alan; Grigoriev, Igor

    2009-06-12

    22 percent ESTs do no align with scaffolds. EST Pipeleine assembles 17126 consensi from the noaligned ESTs. Annotation Pipeline predicts 8564 ORFS on the consensi. Domain analysis of ORFs reveals missing genes. Cluster analysis reveals missing genes. Expression analysis reveals potential strain specific genes.

  9. Transcriptome sequencing and annotation of the microalgae Dunaliella tertiolecta: Pathway description and gene discovery for production of next-generation biofuels

    PubMed Central

    2011-01-01

    Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935

  10. The institute for genomic research Osa1 rice genome annotation database.

    PubMed

    Yuan, Qiaoping; Ouyang, Shu; Wang, Aihui; Zhu, Wei; Maiti, Rama; Lin, Haining; Hamilton, John; Haas, Brian; Sultana, Razvan; Cheung, Foo; Wortman, Jennifer; Buell, C Robin

    2005-05-01

    We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 non-transposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org. PMID:15888674

  11. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  12. Solving the Problem: Genome Annotation Standards before the Data Deluge

    PubMed Central

    Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

    2011-01-01

    The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

  13. Annotation of Differential Gene Expression in Small Yellow Follicles of a Broiler-Type Strain of Taiwan Country Chickens in Response to Acute Heat Stress

    PubMed Central

    Wang, Shih-Han; Tang, Pin-Chi; Chen, Chih-Feng; Chen, Hsin-Hsin; Lee, Yen-Pai; Chen, Shuen-Ei; Huang, San-Yuan

    2015-01-01

    This study investigated global gene expression in the small yellow follicles (6–8 mm diameter) of broiler-type B strain Taiwan country chickens (TCCs) in response to acute heat stress. Twelve 30-wk-old TCC hens were divided into four groups: control hens maintained at 25°C and hens subjected to 38°C acute heat stress for 2 h without recovery (H2R0), with 2-h recovery (H2R2), and with 6-h recovery (H2R6). Small yellow follicles were collected for RNA isolation and microarray analysis at the end of each time point. Results showed that 69, 51, and 76 genes were upregulated and 58, 15, 56 genes were downregulated after heat treatment of H2R0, H2R2, and H2R6, respectively, using a cutoff value of two-fold or higher. Gene ontology analysis revealed that these differentially expressed genes are associated with the biological processes of cell communication, developmental process, protein metabolic process, immune system process, and response to stimuli. Upregulation of heat shock protein 25, interleukin 6, metallopeptidase 1, and metalloproteinase 13, and downregulation of type II alpha 1 collagen, discoidin domain receptor tyrosine kinase 2, and Kruppel-like factor 2 suggested that acute heat stress induces proteolytic disintegration of the structural matrix and inflamed damage and adaptive responses of gene expression in the follicle cells. These suggestions were validated through gene expression, using quantitative real-time polymerase chain reaction. Functional annotation clarified that interleukin 6-related pathways play a critical role in regulating acute heat stress responses in the small yellow follicles of TCC hens. PMID:26587838

  14. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/ PMID:23842461

  15. The Disease Portals, disease–gene annotation and the RGD disease ontology at the Rat Genome Database

    PubMed Central

    Hayman, G. Thomas; Laulederkind, Stanley J. F.; Smith, Jennifer R.; Wang, Shur-Jen; Petri, Victoria; Nigam, Rajni; Tutaj, Marek; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary

    2016-01-01

    The Rat Genome Database (RGD; http://rgd.mcw.edu/) provides critical datasets and software tools to a diverse community of rat and non-rat researchers worldwide. To meet the needs of the many users whose research is disease oriented, RGD has created a series of Disease Portals and has prioritized its curation efforts on the datasets important to understanding the mechanisms of various diseases. Gene-disease relationships for three species, rat, human and mouse, are annotated to capture biomarkers, genetic associations, molecular mechanisms and therapeutic targets. To generate gene–disease annotations more effectively and in greater detail, RGD initially adopted the MEDIC disease vocabulary from the Comparative Toxicogenomics Database and adapted it for use by expanding this framework with the addition of over 1000 terms to create the RGD Disease Ontology (RDO). The RDO provides the foundation for, at present, 10 comprehensive disease area-related dataset and analysis platforms at RGD, the Disease Portals. Two major disease areas are the focus of data acquisition and curation efforts each year, leading to the release of the related Disease Portals. Collaborative efforts to realize a more robust disease ontology are underway. Database URL: http://rgd.mcw.edu PMID:27009807

  16. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

    PubMed Central

    Gagliano, Sarah A.; Ravji, Reena; Barnes, Michael R.; Weale, Michael E.; Knight, Jo

    2015-01-01

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64–0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies. PMID:26300220

  17. Epigenomic annotation of gene regulatory alterations during evolution of the primate brain.

    PubMed

    Vermunt, Marit W; Tan, Sander C; Castelijns, Bas; Geeven, Geert; Reinink, Peter; de Bruijn, Ewart; Kondova, Ivanela; Persengiev, Stephan; Bontrop, Ronald; Cuppen, Edwin; de Laat, Wouter; Creyghton, Menno P

    2016-03-01

    Although genome sequencing has identified numerous noncoding alterations between primate species, which of those are regulatory and potentially relevant to the evolution of the human brain is unclear. Here we annotated cis-regulatory elements (CREs) in the human, rhesus macaque and chimpanzee genomes using chromatin immunoprecipitation followed by sequencing (ChIP-seq) in different anatomical regions of the adult brain. We found high similarity in the genomic positioning of rhesus macaque and human CREs, suggesting that the majority of these elements were already present in a common ancestor 25 million years ago. Most of the observed regulatory changes between humans and rhesus macaques occurred before the ancestral separation of humans and chimpanzees, leaving a modest set of regulatory elements with predicted human specificity. Our data refine previous predictions and hypotheses on the consequences of genomic changes between primate species and allow the identification of regulatory alterations relevant to the evolution of the brain. PMID:26807951

  18. The UniProt-GO Annotation database in 2011.

    PubMed

    Dimmer, Emily C; Huntley, Rachael P; Alam-Faruque, Yasmin; Sawford, Tony; O'Donovan, Claire; Martin, Maria J; Bely, Benoit; Browne, Paul; Mun Chan, Wei; Eberhardt, Ruth; Gardner, Michael; Laiho, Kati; Legge, Duncan; Magrane, Michele; Pichler, Klemens; Poggioli, Diego; Sehra, Harminder; Auchincloss, Andrea; Axelsen, Kristian; Blatter, Marie-Claude; Boutet, Emmanuel; Braconi-Quintaje, Silvia; Breuza, Lionel; Bridge, Alan; Coudert, Elizabeth; Estreicher, Anne; Famiglietti, Livia; Ferro-Rojas, Serenella; Feuermann, Marc; Gos, Arnaud; Gruaz-Gumowski, Nadine; Hinz, Ursula; Hulo, Chantal; James, Janet; Jimenez, Silvia; Jungo, Florence; Keller, Guillaume; Lemercier, Phillippe; Lieberherr, Damien; Masson, Patrick; Moinat, Madelaine; Pedruzzi, Ivo; Poux, Sylvain; Rivoire, Catherine; Roechert, Bernd; Schneider, Michael; Stutz, Andre; Sundaram, Shyamala; Tognolli, Michael; Bougueleret, Lydie; Argoud-Puy, Ghislaine; Cusin, Isabelle; Duek-Roggli, Paula; Xenarios, Ioannis; Apweiler, Rolf

    2012-01-01

    The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360,000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set. PMID:22123736

  19. The RAST server : rapid annotations using subsystems technology.

    SciTech Connect

    Aziz, R. K.; Bartels, D.; Best, A. A.; DeJongh, M.; Disz, T.; Edwards, R. A.; Formsma, K.; Gerdes, S.; Glass, E. M.; Kubal, M.; Meyer, F.; Olsen, G. J.; Olson, R.; Osterman, A. L.; Overbeek, R. A.; McNeil, L. K.; Paarmann, D.; Paczian, T.; Parrello, B.; Pusch, G. D.; Reich, C.; Stevens, R.; Vassieva, O.; Vonstein, V.; Wilke, A.; Zagnitko, O.; Mathematics and Computer Science; Fellowship for Interpretation of Genomes; Univ. of Chicago; Univ. of Illinois; The Burnham Inst.; Hope Coll.; Univ. of Tenn.; Cairo Univ.

    2008-02-08

    The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12-24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.

  20. Predicting Gene Function using Predictive Clustering Trees

    NASA Astrophysics Data System (ADS)

    Vens, Celine; Schietgat, Leander; Struyf, Jan; Blockeel, Hendrik; Kocev, Dragi; Džeroski, Sašo

    In this chapter, we show how the predictive clustering tree framework can be used to predict the functions of genes. The gene function prediction task is an example of a hierarchical multi-label classification (HMC) task: genes may have multiple functions and these functions are organized in a hierarchy. The hierarchy of functions can be such that each function has at most one parent (tree structure) or such that functions may have multiple parents (DAG structure).

  1. Bioinformatic approaches for functional annotation and pathway inference in metagenomics data

    PubMed Central

    De Filippo, Carlotta; Ramazzotti, Matteo; Fontana, Paolo; Cavalieri, Duccio

    2012-01-01

    Metagenomic approaches are increasingly recognized as a baseline for understanding the ecology and evolution of microbial ecosystems. The development of methods for pathway inference from metagenomics data is of paramount importance to link a phenotype to a cascade of events stemming from a series of connected sets of genes or proteins. Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism, one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires a clear picture of the potentialities of existing tools and development of new tools to characterize, reconstruct and model biochemical and regulatory pathways as the result of integration of function in complex symbiotic interactions of ontologically and evolutionary distinct cell types. PMID:23175748

  2. Annotation of functional variation within non-MHC MS susceptibility loci through bioinformatics analysis.

    PubMed

    Briggs, F B S; Leung, L J; Barcellos, L F

    2014-10-01

    There is a strong and complex genetic component to multiple sclerosis (MS). In addition to variation in the major histocompatibility complex (MHC) region on chromosome 6p21.3, 110 non-MHC susceptibility variants have been identified in Northern Europeans, thus far. The majority of the MS-associated genes are immune related; however, similar to most other complex genetic diseases, the causal variants and biological processes underlying pathogenesis remain largely unknown. We created a comprehensive catalog of putative functional variants that reside within linkage disequilibrium regions of the MS-associated genic variants to guide future studies. Bioinformatics analyses were also conducted using publicly available resources to identify plausible pathological processes relevant to MS and functional hypotheses for established MS-associated variants. PMID:25030428

  3. Discovery and annotation of functional chromatin signatures in the human genome.

    PubMed

    Hon, Gary; Wang, Wei; Ren, Bing

    2009-11-01

    Transcriptional regulation in human cells is a complex process involving a multitude of regulatory elements encoded by the genome. Recent studies have shown that distinct chromatin signatures mark a variety of functional genomic elements and that subtle variations of these signatures mark elements with different functions. To identify novel chromatin signatures in the human genome, we apply a de novo pattern-finding algorithm to genome-wide maps of histone modifications. We recover previously known chromatin signatures associated with promoters and enhancers. We also observe several chromatin signatures with strong enrichment of H3K36me3 marking exons. Closer examination reveals that H3K36me3 is found on well-positioned nucleosomes at exon 5' ends, and that this modification is a global mark of exon expression that also correlates with alternative splicing. Additionally, we observe strong enrichment of H2BK5me1 and H4K20me1 at highly expressed exons near the 5' end, in contrast to the opposite distribution of H3K36me3-marked exons. Finally, we also recover frequently occurring chromatin signatures displaying enrichment of repressive histone modifications. These signatures mark distinct repeat sequences and are associated with distinct modes of gene repression. Together, these results highlight the rich information embedded in the human epigenome and underscore its value in studying gene regulation. PMID:19918365

  4. Rotavirus gene structure and function.

    PubMed Central

    Estes, M K; Cohen, J

    1989-01-01

    Knowledge of the structure and function of the genes and proteins of the rotaviruses has expanded rapidly. Information obtained in the last 5 years has revealed unexpected and unique molecular properties of rotavirus proteins of general interest to virologists, biochemists, and cell biologists. Rotaviruses share some features of replication with reoviruses, yet antigenic and molecular properties of the outer capsid proteins, VP4 (a protein whose cleavage is required for infectivity, possibly by mediating fusion with the cell membrane) and VP7 (a glycoprotein), show more similarities with those of other viruses such as the orthomyxoviruses, paramyxoviruses, and alphaviruses. Rotavirus morphogenesis is a unique process, during which immature subviral particles bud through the membrane of the endoplasmic reticulum (ER). During this process, transiently enveloped particles form, the outer capsid proteins are assembled onto particles, and mature particles accumulate in the lumen of the ER. Two ER-specific viral glycoproteins are involved in virus maturation, and these glycoproteins have been shown to be useful models for studying protein targeting and retention in the ER and for studying mechanisms of virus budding. New ideas and approaches to understanding how each gene functions to replicate and assemble the segmented viral genome have emerged from knowledge of the primary structure of rotavirus genes and their proteins and from knowledge of the properties of domains on individual proteins. Localization of type-specific and cross-reactive neutralizing epitopes on the outer capsid proteins is becoming increasingly useful in dissecting the protective immune response, including evaluation of vaccine trials, with the practical possibility of enhancing the production of new, more effective vaccines. Finally, future analyses with recently characterized immunologic and gene probes and new animal models can be expected to provide a basic understanding of what regulates the primary interactions of these viruses with the gastrointestinal tract and the subsequent responses of infected hosts. Images PMID:2556635

  5. RNA-seq-Based Gene Annotation and Comparative Genomics of Four Fungal Grass Pathogens in the Genus Zymoseptoria Identify Novel Orphan Genes and Species-Specific Invasions of Transposable Elements.

    PubMed

    Grandaubert, Jonathan; Bhattacharyya, Amitava; Stukenbrock, Eva H

    2015-07-01

    The fungal pathogen Zymoseptoria tritici (synonym Mycosphaerella graminicola) is a prominent pathogen of wheat. The reference genome of the isolate IPO323 is one of the best-assembled eukaryotic genomes and encodes more than 10,000 predicted genes. However, a large proportion of the previously annotated gene models are incomplete, with either no start or no stop codons. The availability of RNA-seq data allows better predictions of gene structure. We here used two different RNA-seq datasets, de novo transcriptome assemblies, homology-based comparisons, and trained ab initio gene callers to generate a new gene annotation of Z. tritici IPO323. The annotation pipeline was also applied to re-sequenced genomes of three closely related species of Z. tritici: Z. pseudotritici, Z. ardabiliae, and Z. brevis. Comparative analyses of the predicted gene models using the four Zymoseptoria species revealed sets of species-specific orphan genes enriched with putative pathogenicity-related genes encoding small secreted proteins that may play essential roles in virulence and host specificity. De novo repeat identification allowed us to show that few families of transposable elements are shared between Zymoseptoria species while we observe many species-specific invasions and expansions. The annotation data presented here provide a high-quality resource for future studies of Z. tritici and its sister species and provide detailed insight into gene and genome evolution of fungal plant pathogens. PMID:25917918

  6. RNA-seq-Based Gene Annotation and Comparative Genomics of Four Fungal Grass Pathogens in the Genus Zymoseptoria Identify Novel Orphan Genes and Species-Specific Invasions of Transposable Elements

    PubMed Central

    Grandaubert, Jonathan; Bhattacharyya, Amitava; Stukenbrock, Eva H.

    2015-01-01

    The fungal pathogen Zymoseptoria tritici (synonym Mycosphaerella graminicola) is a prominent pathogen of wheat. The reference genome of the isolate IPO323 is one of the best-assembled eukaryotic genomes and encodes more than 10,000 predicted genes. However, a large proportion of the previously annotated gene models are incomplete, with either no start or no stop codons. The availability of RNA-seq data allows better predictions of gene structure. We here used two different RNA-seq datasets, de novo transcriptome assemblies, homology-based comparisons, and trained ab initio gene callers to generate a new gene annotation of Z. tritici IPO323. The annotation pipeline was also applied to re-sequenced genomes of three closely related species of Z. tritici: Z. pseudotritici, Z. ardabiliae, and Z. brevis. Comparative analyses of the predicted gene models using the four Zymoseptoria species revealed sets of species-specific orphan genes enriched with putative pathogenicity-related genes encoding small secreted proteins that may play essential roles in virulence and host specificity. De novo repeat identification allowed us to show that few families of transposable elements are shared between Zymoseptoria species while we observe many species-specific invasions and expansions. The annotation data presented here provide a high-quality resource for future studies of Z. tritici and its sister species and provide detailed insight into gene and genome evolution of fungal plant pathogens. PMID:25917918

  7. Annotation of metabolic and biosynthesis genes from Hessian fly (Diptera: Cecidomyiidae)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Hessian fly is the major insect pest of wheat in the southeastern United States and has traditionally been controlled through the utilization of Hessian fly resistance (R) genes in wheat. Such R genes are a limited resource, and once deployed lose their field effectiveness with time. Using 21 ...

  8. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction

    PubMed Central

    2013-01-01

    Background Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. Results This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. Conclusions Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions. PMID:24070402

  9. Accumulation, functional annotation, and comparative analysis of expressed sequence tags in eggplant (Solanum melongena L.), the third pole of the genus Solanum species after tomato and potato.

    PubMed

    Fukuoka, Hiroyuki; Yamaguchi, Hirotaka; Nunome, Tsukasa; Negoro, Satomi; Miyatake, Koji; Ohyama, Akio

    2010-01-15

    Eggplant (Solanum melongena L.) is a widely grown vegetable crop that belongs to the genus Solanum, which is comprised of more than 1000 species of wide genetic and phenotypic variation. Unlike tomato and potato, Solanum crops that belong to subgenus Potatoe and have been targets for comprehensive genomic studies, eggplant is endemic to the Old World and belongs to a different subgenus, Leptostemonum, and therefore, would be a unique member for comparative molecular biology in Solanum. In this study, more than 60,000 eggplant cDNA clones from various tissues and treatments were sequenced from both the 5'- and 3'-ends, and a unigene set consisting of 16,245 unique sequences was constructed. Functional annotations based on sequence similarity to known plant reference datasets revealed a distribution of functional categories almost similar to that of tomato, while 1316 unigenes were suggested to be eggplant-specific. Sequence-based comparative analysis using putative orthologous gene groups setup by reciprocal sequence comparison among six solanaceous species suggested that eggplant and its wild ally Solanum torvum were clustered separately from subgenus Potatoe species, and then, all Solanum species were clustered separately from the genus Capsicum. Microsatellite motif distribution was different among species and likely to be coincident with the phylogenetic relationships. Furthermore, the eggplant unigene dataset exhibited its utility in transcriptome analysis by the SAGE strategy where a considerable number of short tag sequences of interest were successfully assigned to unigenes and their functional annotations. The eggplant ESTs and 16k unigene set developed in this study would be a useful resource not only for molecular genetics and breeding in eggplant itself, but for expanding the scope of comparative biology in Solanum species. PMID:19857557

  10. Gene discovery and gene function assignment in filamentous fungi

    PubMed Central

    Hamer, Lisbeth; Adachi, Kiichi; Montenegro-Chamorro, Maria V.; Tanzer, Matthew M.; Mahanty, Sanjoy K.; Lo, Clive; Tarpey, Rex W.; Skalchunes, Amy R.; Heiniger, Ryan W.; Frank, Sheryl A.; Darveaux, Blaise A.; Lampe, David J.; Slater, Ted M.; Ramamurthy, Lakshman; DeZwaan, Todd M.; Nelson, Grant H.; Shuster, Jeffrey R.; Woessner, Jeffrey; Hamer, John E.

    2001-01-01

    Filamentous fungi are a large group of diverse and economically important microorganisms. Large-scale gene disruption strategies developed in budding yeast are not applicable to these organisms because of their larger genomes and lower rate of targeted integration (TI) during transformation. We developed transposon-arrayed gene knockouts (TAGKO) to discover genes and simultaneously create gene disruption cassettes for subsequent transformation and mutant analysis. Transposons carrying a bacterial and fungal drug resistance marker are used to mutagenize individual cosmids or entire libraries in vitro. Cosmids are annotated by DNA sequence analysis at the transposon insertion sites, and cosmid inserts are liberated to direct insertional mutagenesis events in the genome. Based on saturation analysis of a cosmid insert and insertions in a fungal cosmid library, we show that TAGKO can be used to rapidly identify and mutate genes. We further show that insertions can create alterations in gene expression, and we have used this approach to investigate an amino acid oxidation pathway in two important fungal phytopathogens. PMID:11296265

  11. Elucidating gene function and function evolution through comparison of co-expression networks of plants

    PubMed Central

    Hansen, Bjoern O.; Vaid, Neha; Musialak-Lange, Magdalena; Janowski, Marcin; Mutwil, Marek

    2014-01-01

    The analysis of gene expression data has shown that transcriptionally coordinated (co-expressed) genes are often functionally related, enabling scientists to use expression data in gene function prediction. This Focused Review discusses our original paper (Large-scale co-expression approach to dissect secondary cell wall formation across plant species, Frontiers in Plant Science 2:23). In this paper we applied cross-species analysis to co-expression networks of genes involved in cellulose biosynthesis. We showed that the co-expression networks from different species are highly similar, indicating that whole biological pathways are conserved across species. This finding has two important implications. First, the analysis can transfer gene function annotation from well-studied plants, such as Arabidopsis, to other, uncharacterized plant species. As the analysis finds genes that have similar sequence and similar expression pattern across different organisms, functionally equivalent genes can be identified. Second, since co-expression analyses are often noisy, a comparative analysis should have higher performance, as parts of co-expression networks that are conserved are more likely to be functionally relevant. In this Focused Review, we outline the comparative analysis done in the original paper and comment on the recent advances and approaches that allow comparative analyses of co-function networks. We hypothesize that in comparison to simple co-expression analysis, comparative analysis would yield more accurate gene function predictions. Finally, by combining comparative analysis with genomic information of green plants, we propose a possible composition of cellulose biosynthesis machinery during earlier stages of plant evolution. PMID:25191328

  12. De Novo Assembly, Gene Annotation and Marker Development Using Illumina Paired-End Transcriptome Sequences in Celery (Apium graveolens L.)

    PubMed Central

    Fu, Nan; Wang, Qian; Shen, Huo-Lin

    2013-01-01

    Background Celery is an increasing popular vegetable species, but limited transcriptome and genomic data hinder the research to it. In addition, a lack of celery molecular markers limits the process of molecular genetic breeding. High-throughput transcriptome sequencing is an efficient method to generate a large transcriptome sequence dataset for gene discovery, molecular marker development and marker-assisted selection breeding. Principal Findings Celery transcriptomes from four tissues were sequenced using Illumina paired-end sequencing technology. De novo assembling was performed to generate a collection of 42,280 unigenes (average length of 502.6 bp) that represent the first transcriptome of the species. 78.43% and 48.93% of the unigenes had significant similarity with proteins in the National Center for Biotechnology Information (NCBI) non-redundant protein database (Nr) and Swiss-Prot database respectively, and 10,473 (24.77%) unigenes were assigned to Clusters of Orthologous Groups (COG). 21,126 (49.97%) unigenes harboring Interpro domains were annotated, in which 15,409 (36.45%) were assigned to Gene Ontology(GO) categories. Additionally, 7,478 unigenes were mapped onto 228 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). Large numbers of simple sequence repeats (SSRs) were indentified, and then the rate of successful amplication and polymorphism were investigated among 31 celery accessions. Conclusions This study demonstrates the feasibility of generating a large scale of sequence information by Illumina paired-end sequencing and efficient assembling. Our results provide a valuable resource for celery research. The developed molecular markers are the foundation of further genetic linkage analysis and gene localization, and they will be essential to accelerate the process of breeding. PMID:23469050

  13. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (∼3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  14. INTERFEROME v2.0: an updated database of annotated interferon-regulated genes

    PubMed Central

    Rusinova, Irina; Forster, Sam; Yu, Simon; Kannan, Anitha; Masse, Marion; Cumming, Helen; Chapman, Ross; Hertzog, Paul J.

    2013-01-01

    Interferome v2.0 (http://interferome.its.monash.edu.au/interferome/) is an update of an earlier version of the Interferome DB published in the 2009 NAR database edition. Vastly improved computational infrastructure now enables more complex and faster queries, and supports more data sets from types I, II and III interferon (IFN)-treated cells, mice or humans. Quantitative, MIAME compliant data are collected, subjected to thorough, standardized, quantitative and statistical analyses and then significant changes in gene expression are uploaded. Comprehensive manual collection of metadata in v2.0 allows flexible, detailed search capacity including the parameters: range of -fold change, IFN type, concentration and time, and cell/tissue type. There is no limit to the number of genes that can be used to search the database in a single query. Secondary analysis such as gene ontology, regulatory factors, chromosomal location or tissue expression plots of IFN-regulated genes (IRGs) can be performed in Interferome v2.0, or data can be downloaded in convenient text formats compatible with common secondary analysis programs. Given the importance of IFN to innate immune responses in infectious, inflammatory diseases and cancer, this upgrade of the Interferome to version 2.0 will facilitate the identification of gene signatures of importance in the pathogenesis of these diseases. PMID:23203888

  15. Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships.

    PubMed

    Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M

    2013-10-01

    The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. PMID:23850840

  16. Discovering Functions of Unannotated Genes from a Transcriptome Survey of Wild Fungal Isolates

    PubMed Central

    Ellison, Christopher E.; Kowbel, David; Glass, N. Louise; Taylor, John W.

    2014-01-01

    ABSTRACT Most fungal genomes are poorly annotated, and many fungal traits of industrial and biomedical relevance are not well suited to classical genetic screens. Assigning genes to phenotypes on a genomic scale thus remains an urgent need in the field. We developed an approach to infer gene function from expression profiles of wild fungal isolates, and we applied our strategy to the filamentous fungus Neurospora crassa. Using transcriptome measurements in 70 strains from two well-defined clades of this microbe, we first identified 2,247 cases in which the expression of an unannotated gene rose and fell across N. crassa strains in parallel with the expression of well-characterized genes. We then used image analysis of hyphal morphologies, quantitative growth assays, and expression profiling to test the functions of four genes predicted from our population analyses. The results revealed two factors that influenced regulation of metabolism of nonpreferred carbon and nitrogen sources, a gene that governed hyphal architecture, and a gene that mediated amino acid starvation resistance. These findings validate the power of our population-transcriptomic approach for inference of novel gene function, and we suggest that this strategy will be of broad utility for genome-scale annotation in many fungal systems. PMID:24692637

  17. Augmented annotation and orthologue analysis for Oryctolagus cuniculus: Better Bunny

    PubMed Central

    2012-01-01

    Background The rabbit is an important model organism used in a wide range of biomedical research. However, the rabbit genome is still sparsely annotated, thus prohibiting extensive functional analysis of gene sets derived from whole-genome experiments. We developed a web-based application that provides augmented annotation and orthologue analysis for rabbit genes. Importantly, the application allows comprehensive functional analysis through the use of orthologous relationships. Results Using data extracted from several public bioinformatics repositories we created Better Bunny, a database and query tool that extensively augments the available functional annotation for rabbit genes. Using the complete set of target genes from a commercial rabbit gene expression microarray as our benchmark, we are able to obtain functional information for 88 % of the genes on the microarray. Previously, functional information was available for fewer than 10 % of the rabbit genes. Conclusions We have developed a freely available, web-accessible bioinformatics tool that enables investigators to quickly and easily perform extensive functional analysis of rabbit genes (http://cptweb.cpt.wayne.edu). The software application fills a critical void for a wide range of biomedical research that relies on the rabbit model and requires characterization of biological function for large sets of genes. PMID:22568790

  18. Gene3D: merging structure and function for a Thousand genomes.

    PubMed

    Lees, Jonathan; Yeats, Corin; Redfern, Oliver; Clegg, Andrew; Orengo, Christine

    2010-01-01

    Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website (http://gene3d.biochem.ucl.ac.uk/). Gene3D provides accurate structural domain family assignments for over 1100 genomes and nearly 10,000,000 proteins. A hidden Markov model library, constructed from the manually curated CATH structural domain hierarchy, is used to search UniProt, RefSeq and Ensembl protein sequences. The resulting matches are refined into simple multi-domain architectures using a recently developed in-house algorithm, DomainFinder 3 (available at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/). The domain assignments are integrated with multiple external protein function descriptions (e.g. Gene Ontology and KEGG), structural annotations (e.g. coiled coils, disordered regions and sequence polymorphisms) and family resources (e.g. Pfam and eggNog) and displayed on the Gene3D website. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. Gene3D also provides a set of services, including an interactive genome coverage graph visualizer, DAS annotation resources, sequence search facilities and SOAP services. PMID:19906693

  19. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGESBeta

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; et al

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  20. Experimental Strategies for Functional Annotation and Metabolism Discovery: Targeted Screening of Solute Binding Proteins and Unbiased Panning of Metabolomes

    PubMed Central

    2015-01-01

    The rate at which genome sequencing data is accruing demands enhanced methods for functional annotation and metabolism discovery. Solute binding proteins (SBPs) facilitate the transport of the first reactant in a metabolic pathway, thereby constraining the regions of chemical space and the chemistries that must be considered for pathway reconstruction. We describe high-throughput protein production and differential scanning fluorimetry platforms, which enabled the screening of 158 SBPs against a 189 component library specifically tailored for this class of proteins. Like all screening efforts, this approach is limited by the practical constraints imposed by construction of the library, i.e., we can study only those metabolites that are known to exist and which can be made in sufficient quantities for experimentation. To move beyond these inherent limitations, we illustrate the promise of crystallographic- and mass spectrometric-based approaches for the unbiased use of entire metabolomes as screening libraries. Together, our approaches identified 40 new SBP ligands, generated experiment-based annotations for 2084 SBPs in 71 isofunctional clusters, and defined numerous metabolic pathways, including novel catabolic pathways for the utilization of ethanolamine as sole nitrogen source and the use of d-Ala-d-Ala as sole carbon source. These efforts begin to define an integrated strategy for realizing the full value of amassing genome sequence data. PMID:25540822

  1. Functional analyses of cellulose synthase genes in flax (Linum usitatissimum) by virus-induced gene silencing.

    PubMed

    Chantreau, Maxime; Chabbert, Brigitte; Billiard, Sylvain; Hawkins, Simon; Neutelings, Godfrey

    2015-12-01

    Flax (Linum usitatissimum) bast fibres are located in the stem cortex where they play an important role in mechanical support. They contain high amounts of cellulose and so are used for linen textiles and in the composite industry. In this study, we screened the annotated flax genome and identified 14 distinct cellulose synthase (CESA) genes using orthologous sequences previously identified. Transcriptomics of 'primary cell wall' and 'secondary cell wall' flax CESA genes showed that some were preferentially expressed in different organs and stem tissues providing clues as to their biological role(s) in planta. The development for the first time in flax of a virus-induced gene silencing (VIGS) approach was used to functionally evaluate the biological role of different CESA genes in stem tissues. Quantification of transcript accumulation showed that in many cases, silencing not only affected targeted CESA clades, but also had an impact on other CESA genes. Whatever the targeted clade, inactivation by VIGS affected plant growth. In contrast, only clade 1- and clade 6-targeted plants showed modifications in outer-stem tissue organization and secondary cell wall formation. In these plants, bast fibre number and structure were severely impacted, suggesting that the targeted genes may play an important role in the establishment of the fibre cell wall. Our results provide new fundamental information about cellulose biosynthesis in flax that should facilitate future plant improvement/engineering. PMID:25688574

  2. Automatic annotation of organellar genomes with DOGMA

    SciTech Connect

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  3. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  4. Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics.

    PubMed

    Hrabě de Angelis, Martin; Nicholson, George; Selloum, Mohammed; White, Jacqueline K; Morgan, Hugh; Ramirez-Solis, Ramiro; Sorg,